Derek-Jones / ESEUR-book

Issue handling for Evidence-based Software Engineering: based on the publicly available data
http://www.knosof.co.uk/ESEUR/
279 stars 18 forks source link

Typo #11

Closed bosepchuk closed 4 years ago

bosepchuk commented 4 years ago

Analyses instead of analysis on page 11?

B78B34FC-5268-4147-B7B5-B7AE881D4F2A

Derek-Jones commented 4 years ago

Yes, fixed.

If it makes your life easier, I'm happy to take a list of typos as one Github issue.

bosepchuk commented 4 years ago

I'm open to whatever. Did you mean for me to just add the typos to this issue as I find them instead of creating a new issue for each one?

One per comment?


I'm a fan, by the way. You've picked an important topic for this book.


On page 8. "Many low-cost processor have a very simple architecture with rel- atively few instructions and parameter passing..."

Processor should be processors?

Derek-Jones commented 4 years ago

I want to reduce the friction of reporting issues. It does not matter to me how they are reported. So I am happy to go with whatever is easiest for you.

Yes, processor should be processors.

bosepchuk commented 4 years ago

Page 22. "Early researchers, investigating human behavior, found that people do not always respond ways that are consistent with the mathematically optimal models of behavior that had been created."

Should be "respond in ways..."?

Derek-Jones commented 4 years ago

Yes.

Derek-Jones commented 4 years ago

Feel free to make comments on the material. It's a whirlwind tour intended to give a basic understanding and motivate an interest in learning more.

bosepchuk commented 4 years ago

Let me get through a little more of the material before I venture any editorial comments.


Page 25. "affect reading performance&emdash" Incorrect character escaping?

Derek-Jones commented 4 years ago

&emdash missing a ; (which the document processing system maps to a character of said name)

bosepchuk commented 4 years ago

I downloaded a copy of the document, highlighted and added comments to it, then uploaded it to Dropbox, which you can see at the following link:

https://www.dropbox.com/s/qbqr62mxxm2x0ow/ESEUR-draft.pdf?dl=0

My first comment is on page 27 and my last is on page 113.

Derek-Jones commented 4 years ago

Thanks. Got it.

I am working forwards through the chapters reviewing them, starting at page 205. I have reached page 130 (which you will soon hit).

I can create an update for you to work from, later today. In places this update will have a poor alignment between text and margin figures (which might even toverlay table sin places).

bosepchuk commented 4 years ago

That sounds great. My email is bosepchuk@gmail.com. I'm okay with continuing to communicate in this issue, but we can switch to email if you want a little more privacy. Your call.

Derek-Jones commented 4 years ago

I have fixed all your suggested missing words & typos.

For instance, a flying object with feathers, and a beak might be assigned to the category bird, which suggests the characteristics of laying eggs and being migratory.

There are many non-profit migratory birds. Perhaps replace migratory with nesting?

My words are a shortened version of what appears in the paper. Not sure about the term "non-profit". I have heard of the profit bird, but Google does not return anything related for "non-profit migratory" bird?

Right alignment of: Alfred North Whitehead: ¿It is a profoundly...

Either me being lazy and switching the alignment for this case, or thinking it makes the text stand out. But then it occurs later. Yes, need to get sidenotes to adjust itself to double page layout...

Without reliable techniques for measuring personality traits, it is not possible to isolate characteristics likely to be beneficial or detrimental to software development. Perhaps one of the most important traits is an ability to concentrate for large amounts of time on particular activities.

Do you have evidence for making this claim?

None, and looking around I cannot find any. I have seen this said in several places, and probably included it hoping to find data.

Changed to: For instance, how important is the ability to concentrate for large amounts of time on particular activities.

I have to disagree with you here. The worst¿programmers actually set the project back. Every hour they work requires someone else to spend more than an hour fixing everything they broke. So, the best programmers are infinitely more productive than the worst.

Yes, I have worked with such people.

Any analysis has to cover the common, or average case, and perhaps point out the outliers. I don't have any data, so cannot do any analysis.

The 28-to-1 claim has to discussed because it is part of folklore, although I'm not sure how common it is these days.

bosepchuk commented 4 years ago

On Thu, Sep 17, 2020 at 20:12 Derek M. Jones notifications@github.com wrote:

I have fixed all your suggested missing words & typos.

Awesome.

For instance, a flying object with feathers, and a beak might be

assigned to the category bird, which suggests the characteristics of

laying eggs and being migratory.

There are many non-profit migratory birds.

Perhaps replace migratory with nesting?

Sorry. That's an auto-complete error on my ipad. What I meant to write is "there are many non-migratory birds." (so that's poor example to use as a characteristic of all birds).

My words are a shortened version of what appears in the paper. Not sure about the term "non-profit". I have heard of the profit bird, but Google does not return anything related for "non-profit migratory" bird?

Right alignment of:

Alfred North Whitehead: ¿It is a profoundly...

Either me being lazy and switching the alignment for this case, or thinking it makes the text stand out. But then it occurs later. Yes, need to get sidenotes to adjust itself to double page layout...

Without reliable techniques for measuring personality traits, it is not possible to isolate

characteristics likely to be beneficial or detrimental to software development. Perhaps

one of the most important traits is an ability to concentrate for large amounts of time on

particular activities.

Do you have evidence for making this claim?

None, and looking around I cannot find any. I have seen this said in several places, and probably included it hoping to find data.

I have too. Lots of stories and theories, but no data.

Changed to:

For instance, how important is the ability to concentrate for large amounts of time on particular activities.

I have to disagree with you here. The worst¿programmers actually

set the project back. Every hour they work requires someone else

to spend more than an hour fixing everything they broke. So,

the best programmers are infinitely more productive than the worst.

Yes, I have worked with such people.

Any analysis has to cover the common, or average case, and perhaps point out the outliers. I don't have any data, so cannot do any analysis.

The 28-to-1 claim has to discussed because it is part of folklore, although I'm not sure how common it is these days.

There are frequent discussions of developer productivity on the blogs I read. And 10x is more common that 28x, but those ideas are definitely still circulating.

The way I most commonly see this topic framed is that the best programmers are X times (ie 10 times) more productive than the worst programmers. And if that is, indeed, the question, then the answer is that the best programmers are infinitely more productive than the worst programmers, as a matter of simple arithmetic (if the worst programmers have negative productivity).

Of course, you'd prefer data. But before you can design an experiment to prove or disprove this idea you need a reliable and valid way to measure productivity, which software engineering does not have.

So I'll leave it to you what you write there.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Derek-Jones/ESEUR-book/issues/11#issuecomment-694606620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW22MI5IM35UQVEUO6BCZDSGK6YHANCNFSM4RMIBUIA .

bosepchuk commented 4 years ago

Comments from page 121 to page 133 are here: https://www.dropbox.com/s/a0q1blkx3cg71dt/ESEUR-draft-preview_2020-09-17.pdf?dl=0

On Thu, Sep 17, 2020 at 20:59 Blaine Osepchuk bosepchuk@gmail.com wrote:

On Thu, Sep 17, 2020 at 20:12 Derek M. Jones notifications@github.com wrote:

I have fixed all your suggested missing words & typos.

Awesome.

For instance, a flying object with feathers, and a beak might be

assigned to the category bird, which suggests the characteristics of

laying eggs and being migratory.

There are many non-profit migratory birds.

Perhaps replace migratory with nesting?

Sorry. That's an auto-complete error on my ipad. What I meant to write is "there are many non-migratory birds." (so that's poor example to use as a characteristic of all birds).

My words are a shortened version of what appears in the paper. Not sure about the term "non-profit". I have heard of the profit bird, but Google does not return anything related for "non-profit migratory" bird?

Right alignment of:

Alfred North Whitehead: ¿It is a profoundly...

Either me being lazy and switching the alignment for this case, or thinking it makes the text stand out. But then it occurs later. Yes, need to get sidenotes to adjust itself to double page layout...

Without reliable techniques for measuring personality traits, it is not possible to isolate

characteristics likely to be beneficial or detrimental to software development. Perhaps

one of the most important traits is an ability to concentrate for large amounts of time on

particular activities.

Do you have evidence for making this claim?

None, and looking around I cannot find any. I have seen this said in several places, and probably included it hoping to find data.

I have too. Lots of stories and theories, but no data.

Changed to:

For instance, how important is the ability to concentrate for large amounts of time on particular activities.

I have to disagree with you here. The worst¿programmers actually

set the project back. Every hour they work requires someone else

to spend more than an hour fixing everything they broke. So,

the best programmers are infinitely more productive than the worst.

Yes, I have worked with such people.

Any analysis has to cover the common, or average case, and perhaps point out the outliers. I don't have any data, so cannot do any analysis.

The 28-to-1 claim has to discussed because it is part of folklore, although I'm not sure how common it is these days.

There are frequent discussions of developer productivity on the blogs I read. And 10x is more common that 28x, but those ideas are definitely still circulating.

The way I most commonly see this topic framed is that the best programmers are X times (ie 10 times) more productive than the worst programmers. And if that is, indeed, the question, then the answer is that the best programmers are infinitely more productive than the worst programmers, as a matter of simple arithmetic (if the worst programmers have negative productivity).

Of course, you'd prefer data. But before you can design an experiment to prove or disprove this idea you need a reliable and valid way to measure productivity, which software engineering does not have.

So I'll leave it to you what you write there.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Derek-Jones/ESEUR-book/issues/11#issuecomment-694606620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW22MI5IM35UQVEUO6BCZDSGK6YHANCNFSM4RMIBUIA .

Derek-Jones commented 4 years ago

Response to comments up to the start of page 132:

I have fixed all your suggested extra/missing words & typos.

"The first go-around at it was about $750 million, so you figure that¿s not a bad cost ... investment analysis for a $750 million IT investment that turned into a billion dollars."1748

The highlighted text doesn't appear to fit with the previous paragraph. Maybe it needs an introduction?

Moved paragraph that followed it to before it, and added: "For instance:"

It can be unwise to ask clients why they want the software. Be thankful that somebody is willing to pay to have bespoke software written,732 creating employment for software developers. For instance:

A project may fail because the users of a system resist its introduction into their workflow1054 (e.g., they perceive its use as a threat to their authority).

Or it's slower, or awkward, or ...

Yes, there can be are other reasons. If the list starts to get too long it starts to look like evidence of the top items.

What do the blue lines in figure 5.7 mean?

Added: ", with 95% confidence intervals."

the vendor will no longer support it), the vendor changes the functionality or configuration of the system resulting in unpleasant for one or more client,

something's not right here

Changed to: ... the vendor changes the functionality or configuration of the system resulting in unpleasant or unintended consequences for one or more clients,...

While the reasons of wanting a cost estimate cannot be disputed,

reasons for wanting...?

I did not think an example was worthwhile, otherwise it could be disputed. Would you dispute this desire for a cost estimate?

Cost overruns are often blamed on poor project management,1071 however, the estimates made be the product of rational thinking during the bidding process (e.g., a low value

Something's wrong with the highlighted text

"made be" -> "may be"

Software projects and public are awarded to the lowest bidder over budget. But is that because they are underestimated or because the vendors want the work and know they need to bid below cost to get the work and hope to make a profit on additional fees or follow-up work?

I think you mean "under budget" Yes to your questions. Covered earlier/later in chapter.

Figure 5.10 shows that as the size of a project increases, the percentage of project effort consumed by management time increases.

Does it? Another interpretation of that data might be that it tops out around 20-25%. It seems unreasonable to expect to see management effort completely overtake development effort on the world's largest software projects.

Yes, and the fitted quadratic equation is misleading. Something better is needed...

Derek-Jones commented 4 years ago

Response to comments up to the start of page 164:

I have fixed all your suggested extra/missing words & typos.

Figure 5.10 shows that as the size of a project increases, the percentage of project effort consumed by management time increases.

Does it? Another interpretation of that data might be that it tops out around 20-25%. It seems unreasonable to expect to see management effort completely overtake development effort on the world's largest software projects.

< A logistic regression is a better fit, but not perfect, < tops out at 12 and 16%.

<<ahonen2015, Figure>> shows that as the size of a project increases, the percentage of project effort consumed by management time rapidly increases to a plateau, with fixed-price contracts involving a greater percentage of management time.

I'm confused by this data set. The team size more than double but everything is expressed in work days. Couldn't that explain number much of the variation over time

The 7digital data is elapsed time to implement a feature (in days). No information is available on the number of people working on any feature.

In other words, there is no single process dominating implemen-tation time; improving feature delivery time requires improving many different processes(the average elapsed duration to implement a feature has decreased over time).

Maybe. Donald Reinertsen wrote that the biggest contributor to feature delivery duration is how long work in progress spends waiting ... Is there any chance your dataset is amenable to that kind of analysis?

Not in the 7digital data. But the SiP data has this information, see Figure 5.44.

Ea1 Dr < (Ea1 Dt ¿ T e ) + (Ea2 + En )(Dr ¿ Dt ),

I analyzed my time sheets from a recent year and discovered I spent 10% of my time on recruiting activities (screen resumes, interviews, meetings to discuss potential hires, etc.).

This time is in the equation, or rather it is not part of the equation. Ea1 is the total daily team effort, so recruiting is not included (it would need to be included if percentages were involved).

What are the blue lines...

Going through and adding "with 95% confidence intervals"

... bug supporting useful functionality.155

Something's not right here.

... but supporting useful functionality.155

This chapter discusses the kinds of mistakes made, where they occur in the development process, methods used to locate them and techniques for estimating how many fault experiences can potentially occur.

I would make this the first sentence in this chapter.

I want the chapters to start with a bang, not boring housekeeping stuff.

I don't understand "a replication"

Me trying to be too clever (removed).

All mistakes have the potential to have costly consequences, 37

This seems unlikely to be true.

In the vast majority of cases the probability is very low. It's all about context.

How robust is code to small changes to the correct value of a variable?

Or the code after the faulty assignment is modified in such a way that the faulty assigment can now trigger a fault experience.

Yes, the original assignment might not have been correct.

It's also possible that the faulty assignment is in unreachable code (aka dead code).

Yes.

Both are possible situations that introduces some uncertainty into the result.

The root cause of a mistake,

You make it sound like there are only two kinds of mistakes programmers can make, which clearly isn't true.

There are only three root causes (as defined by cited autor, who everybody else follows). I had forgotten one...

Changed to: The root cause of a mistake, made by a person, may be knowledge based (e.g., lack of knowledge about the semantics of the programming language used), rule based (e.g., failure to correctly apply known coding rules), or skill based (e.g., fail to copy the correct value of a numeric constant in an assignment) .

There appear to be two distinct language groupings, each having similar successful compilation rates; one commonality of languages in each group is requiring, or not, variables to be declared before use.

Is this a round-about way of saying that mutations of programs written in compiled languages were much more strict about the source code they would compile compared to interpreted languages?

The problem is that people use a variety of definitions of compiled. Technically they are all compiled, in that they are converted to some internal form. Some get converted to machine code, which is a common definition of compiler. The roundabout wording gets me off the hook.

Figure 6.35

It's hard to tell which language is which in the blue-green range, and I'm not color blind.

Color selection for this plot uses the gentler hcl approach. Code not touched since 2015. Gone for the full rainbow, blow there eyes out plot :-)

A metric that assigns a value to individual functions (i.e., its value is calculated from the contents of single functions) cannot be used as a control mechanism (i.e., require that

Just because a programmer can manipulate the McCabe complexity of her functions to stay below a threshold doesn't mean that it's not a useful practice for improving the understandability of a program.

McCabe's complexity measures no such thing, and I don't know of any metric that measures understandability.

Breaking a function into part1(), part2(), part3() may manipulate the McCabe metric and not add anything useful to the understandability or maintainability of the program but such manipulations are easy to catch in a code review.

There are plenty of code reviews that recommend splitting a function up for this very reason. I know consultants who demand such splitting up. It's all accounting fraud.

... can help programmers identify code that could be improved by reducing its complexity.

I don't know of any metric that reliabily measures complexity. In fact I don't know what program complexity really is.

The fixes for most fault reports involve changing a few lines in a single function, and these changes occur within a single file. A study701 of over 1,000 projects for each of C,

This seems unlikely to me. I'd believe it if the study was only looking at defects introduced in the coding phase and reported by users after the software was tested and put into production.

But about 50% of defects are present in projects before coding begins and fixing, requirements and design errors involve much greater code changes than defects introduced during coding.

Can I have a copy of your data :-)

"fault reports" -> "user fault reports"

Added: One study found that 36% of mistakes logged during development were made in phases that came before coding (Team Software Process was used and many of the mistakes may have been minor); see rexample[reliability/2018_005_defects.R].

against normalized

What does normalized mean in this context?

against normalized (i.e., each maximum is 100) number of commits made while making these changes

... found 1,784 undocumented error return codes

1784 isn't very meaningful without knowing how many functions were analyzed.

What percentage of functions had documented return codes that did not match the source?

A study by Rubio-Gonzalez and Libit <book Rubio-Gonzalez_10> investigated the source code of 52 Linux file systems, which invoked 42 different system calls and returned 30 different system error codes. The 871 KLOC contained 1,784 instances of undocumented error return codes; see rexample[projects/err-code_mismatch.R]. A study by Ma, Liu and Forin

Figure 6.40 shows how the number of bit-flips increased over time (measured in Mega-bits per hour), for SRAM fabricated using 130 nm, 65 nm and 40 nm processes. The 130 nm and 65 nm measurements were made underground, and the lower rate of bit-flips for

How can we draw any conclusions when so many things are changing?

The rule for the book was to have data for what was discussed. I managed to get these researchers to send me this data. I'm guessing they might eventually publish more elsewhere.

Optimizing for reliability can be traded off against performance,1271 e.g., ordering register usage such that the average interval between load and last usage is reduced.1955

Has optimizing for reliability ever not imposed a performance hit?

Sometimes (often?) there are many ways of doing the same thing for the same performance, and a (random) choice gets made. There may be a reliability oriented choice available (I have never done this stuff in a real compiler, so I don't know).

So-called formal proofs of correctness are essentially a form on N-version programming,

I've never heard formal proofs described this way.

Formal methods researchers love to use fancy words to obscure what are often straight forward techniques.

I'm thinking of TLA+, Z, SPARK (the formal subset of Ada). I guess the Z spec could be considered the first program and the actual source code the second. Is that your meaning?

Yes. They have some great marketing terminology, like "discharging obligations".

Two mistakes per KLOC seems really high compared to the data I've seen from the Adacore presentations.

Presentations in general show products in the best light.

talking about 1-2 orders of magnitude fewer defects than that for their delivered software using all the quality methods they deem prudent, not just formal proofs.

If you spend the time and money this can be achieved. The projects they are referring to have spent the time and money.

The version of the report referenced in 1349 that I found online only contains 120 pages but the footnote references pages 69, 137, and 164.

My copy has 124 pages, and does not reference page 137.

  1. I looked up the reference and this software we produced using TSP, which is incredibly focused on finding and removing errors as soon as possible in the development process. This kind of development is extremely different from that done by > 99% of developers. So I question the fairness of including this kind of data without any mention of the extreme rarity of the development process used.

Added wording saying that TSP was used.

It's project data, so I discuss it. The only people who measure this stuff in volume are doing specilist work.

. I find it hard to believe that 72% of defects could be fixed in less than 10 minutes, unless there's some really interesting definition of 'fixed' in use.

I do as well. I think that most were trivial typos. The author is busy at the moment, as I am. But is interesting in talking more later.

bosepchuk commented 4 years ago

Thanks for reading the time to reply.

Are my comments helpful or just adding to your workload? Is there anything you'd like me to do differently?

On Mon, Sep 21, 2020 at 20:32 Derek M. Jones notifications@github.com wrote:

Response to comments up to the start of page 164:

I have fixed all your suggested extra/missing words & typos.

Figure 5.10 shows that as the size of a project increases, the

percentage of project effort consumed by management time increases.

Does it? Another interpretation of that

data might be that it tops out around

20-25%. It seems unreasonable to

expect to see management effort

completely overtake development effort

on the world's largest software projects.

< A logistic regression is a better fit, but not perfect,

< tops out at 12 and 16%.

<<ahonen2015, Figure>> shows that as the size of a project increases, the percentage of project effort consumed by management time rapidly increases to a plateau, with fixed-price contracts involving a greater percentage of management time.

I'm confused by this data set. The team size more than double but

everything is expressed in work days. Couldn't that explain number

much of the variation over time

The 7digital data is elapsed time to implement a feature (in days). No information is available on the number of people working on any feature.

In other words, there is no single process dominating implemen-tation time; improving feature delivery time requires improving many different processes(the average elapsed duration to implement a feature has decreased over time).

Maybe. Donald Reinertsen wrote that the biggest contributor to feature

delivery duration is how long work in progress spends waiting

...

Is there any chance your dataset is amenable to that kind of analysis?

Not in the 7digital data. But the SiP data has this information, see

Figure 5.44.

Ea1 Dr < (Ea1 Dt ¿ T e ) + (Ea2 + En )(Dr ¿ Dt ),

I analyzed my time sheets from a recent year and discovered I spent 10% of my time on recruiting activities (screen resumes, interviews, meetings to discuss potential hires, etc.).

This time is in the equation, or rather it is not part of the equation.

Ea1 is the total daily team effort, so recruiting is not included (it would need to be included if percentages were involved).

What are the blue lines...

Going through and adding "with 95% confidence intervals"

... bug supporting useful functionality.155

Something's not right here.

... but supporting useful functionality.155

This chapter discusses the kinds of mistakes

made, where they occur in the development process, methods used to locate them and

techniques for estimating how many fault experiences can potentially occur.

I would make this the first sentence in this chapter.

I want the chapters to start with a bang, not boring housekeeping stuff.

Of course. Totally your call.

I don't understand "a replication"

Me trying to be too clever (removed).

All mistakes have the potential to have costly consequences, 37

This seems unlikely to be true.

In the vast majority of cases the probability is very low. It's all about context.

How robust is code to small changes to the correct value of a variable?

Or the code after the faulty assignment is modified in such a way that the faulty assigment can now trigger a fault experience.

Yes, the original assignment might not have been correct.

It's also possible that the faulty assignment is in unreachable code (aka dead code).

Yes.

Both are possible situations that introduces some uncertainty into the result.

The root cause of a mistake,

You make it sound like there are only two kinds of mistakes programmers can make, which clearly isn't true.

There are only three root causes (as defined by cited autor, who everybody else follows). I had forgotten one...

Changed to:

The root cause of a mistake, made by a person, may be knowledge based

(e.g., lack of knowledge about the semantics of the programming

language used), rule based (e.g., failure to correctly apply known

coding rules), or skill based (e.g., fail to copy the correct value

of a numeric constant in an assignment) .

There appear to be two distinct language groupings,

each having similar successful compilation rates; one commonality of languages in each

group is requiring, or not, variables to be declared before use.

Is this a round-about way of saying that mutations of programs written in compiled languages were much more strict about the source code they would compile compared to interpreted languages?

The problem is that people use a variety of definitions of compiled.

Technically they are all compiled, in that they are converted to some internal form. Some get converted to machine code, which is a common definition of compiler. The roundabout wording gets me off the hook.

Figure 6.35

It's hard to tell which language is which in the blue-green range, and I'm not color blind.

Color selection for this plot uses the gentler hcl approach.

Code not touched since 2015.

Gone for the full rainbow, blow there eyes out plot :-)

A metric that assigns a value to individual functions (i.e., its value is calculated from the

contents of single functions) cannot be used as a control mechanism (i.e., require that

Just because a programmer can manipulate the McCabe complexity of her functions to stay below a threshold doesn't mean that it's not a useful practice for improving the understandability of a program.

McCabe's complexity measures no such thing, and I don't know of any metric that measures understandability.

Breaking a function into part1(), part2(), part3() may manipulate the McCabe metric and not add anything useful to the understandability or maintainability of the program but such manipulations are easy to catch in a code review.

There are plenty of code reviews that recommend splitting a function up for this very reason.

I know consultants who demand such splitting up. It's all accounting fraud.

... can help programmers identify code that could be improved by reducing its complexity.

I don't know of any metric that reliabily measures complexity. In fact I don't know what program complexity really is.

The fixes for most fault reports involve changing a few lines in a single function, and

these changes occur within a single file. A study701 of over 1,000 projects for each of C,

This seems unlikely to me. I'd believe it if the study was only looking at defects introduced in the coding phase and reported by users after the software was tested and put into production.

But about 50% of defects are present in projects before coding begins and fixing, requirements and design errors involve much greater code changes than defects introduced during coding.

Can I have a copy of your data :-)

I've seen that in several places. Watts Humphrey was the most reliable person who I heard say it. He tracked practically everything when he was at IBM. But, he just shared the stat, not the data. Sorry.

"fault reports" -> "user fault reports"

Added:

One study found that 36% of mistakes logged during

development were made in phases that came before coding (Team

Software Process was used and many of the mistakes may have been minor); see rexample[reliability/2018_005_defects.R].

against normalized

What does normalized mean in this context?

against normalized (i.e., each maximum is 100) number of commits made while making these changes

... found 1,784 undocumented error return codes

1784 isn't very meaningful without knowing how many functions were analyzed.

What percentage of functions had documented return codes that did not match the source?

A study by Rubio-Gonzalez and Libit investigated the source code of 52 Linux file

systems, which invoked 42 different system calls and returned 30

different system error codes. The 871 KLOC contained 1,784 instances

of undocumented error return codes; see rexample[projects/err-code_mismatch.R]. A study by Ma, Liu and Forin

Figure 6.40 shows how the number of bit-flips increased over time (measured in Mega-bits per hour), for SRAM fabricated using 130 nm, 65 nm and 40 nm processes. The 130

nm and 65 nm measurements were made underground, and the lower rate of bit-flips for

How can we draw any conclusions when so many things are changing?

The rule for the book was to have data for what was discussed.

I managed to get these researchers to send me this data. I'm guessing they might eventually publish more elsewhere.

Optimizing for reliability can be traded off against

performance,1271 e.g., ordering register usage such that the average interval between load

and last usage is reduced.1955

Has optimizing for reliability ever not imposed a performance hit?

Sometimes (often?) there are many ways of doing the same thing for the same performance, and a (random) choice gets made. There may be a reliability oriented choice available (I have never done this stuff in a real compiler, so I don't know).

Yeah, I was thinking more about multi-channel voting computers and spacecraft that write everything to ram three times and then vote when they read back the values. But now that we are talking about it I recall hearing about a compiler someone modified to put guard regions around arrays so if you accidently wrote past the end, you would be less likely to corrupt important data. Other then losing some ram, I suppose that's not a performance hit.

So-called formal proofs of correctness are essentially a form on N-version programming,

I've never heard formal proofs described this way.

Formal methods researchers love to use fancy words to obscure what are often straight forward techniques.

I'm thinking of TLA+, Z, SPARK (the formal subset of Ada). I guess the Z spec could be considered the first program and the actual source code the second. Is that your meaning?

Yes. They have some great marketing terminology, like "discharging obligations".

Two mistakes per KLOC seems really high compared to the data I've seen from the Adacore presentations.

Presentations in general show products in the best light.

Agreed. They were mostly defence projects, and I imagine no expense was spared.

talking about 1-2 orders of magnitude fewer defects than that for their delivered software using all the quality methods they deem prudent, not just formal proofs.

If you spend the time and money this can be achieved. The projects they are referring to have spent the time and money.

The version of the report referenced in 1349 that I found online only contains 120 pages but the footnote references pages 69, 137, and 164.

My copy has 124 pages, and does not reference page 137.

  1. I looked up the reference and this software we produced using TSP, which is incredibly focused on finding and removing errors as soon as possible in the development process. This kind of development is extremely different from that done by > 99% of developers. So I question the fairness of including this kind of data without any mention of the extreme rarity of the development process used.

Added wording saying that TSP was used.

It's project data, so I discuss it. The only people who measure this stuff in volume are doing specilist work.

. I find it hard to believe that 72% of defects could be fixed in less than 10 minutes, unless there's some really interesting definition of 'fixed' in use.

I do as well. I think that most were trivial typos.

The author is busy at the moment, as I am. But is interesting in talking more later.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Derek-Jones/ESEUR-book/issues/11#issuecomment-696478165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW22MPDQWXH5CE5YPLSH33SHAEEXANCNFSM4RMIBUIA .

Derek-Jones commented 4 years ago

Yes, your comments are helpful. I have been sloppy in places and you have helped me improve the wording.

Watts Humphrey was the most reliable person who I heard say it.

I emailed him fishing for data. He did not have any, or at least any he could share.

Derek-Jones commented 4 years ago

Figure ??

If the commands for a figure appear after the text that ends a page, the figure disappears, so there is label for the Figure to refer to. A weird feature, dare I say bug, of LaTeX. This is one of the issues I have to fix up (by moving the commands to plot the figure) before a release.

Are the sharp declines in new fault experiences due to the winding down of investment in the closing weeks of testing?

... The defect discovery rate is also really low compared to what I'd find at my work. Are the projects unusual in some way? TSP? Safety critical?

(normalised) -> (each normalised to sum to 100)

Reworded last sentence:

"The sharp decline in new fault experiences may be due to there being few mistakes remaining, a winding down of investment in the closing weeks of testing (i.e., rerunning the same tests with the same input), or some other behavior."

finds that the counts are not statistically different across method of detection;

What does that mean for the use of fuzzers?

Oops, an important ommission on my part: "... finds that the differences between the counts is not statistically different across method of detection; see"

I have no idea what this figure means.

What's an DFA node?

What are edge-fail, edge-match, node-fail, and node-match?

Wording is missing important details. Modified to:

"A regular expression can be represented as a deterministic finite state automata (DFA), with nodes denoting states and each edge denoting a basic subcomponent of the regular expression. Coverage testing of a regular expression involves counting the number of nodes and edges visited by the test input.

<<fse2018, Figure>> shows a violin plot of the percentage of regular expression components having a given coverage. The nodes and edges of the DFA representation of each of the 15,096 regular expressions are the components measured, using the corresponding test input strings for each regex; coverage if measured for both matching and failing inputs.

.Violin plots of percentage of regular expression components having a given coverage, (measured using the nodes and edges of the DFA representation of the regular expression, broken down by the match failing/succeeding) for 15,096 regular expressions, when passed the corresponding project test input strings. Data kindly provided by Wang ."

The generation of the change patterns used in combinatorial testing can be very similar to those used in the design of experiments

I'm happy to see you mention combinatoric testing but you didn't really explain why it's helpful/important.

Yes, I forgot to sell the benefits. Added: ", and the same techniques can be used to help minimise the number of test cases"

I'm not sure this is a weakness of MC/DC, as much as a fact about how it works.

A fact can describe a weakness.

I think the weakness of MC/DC is that:

  1. it is extremely tedious (and expensive) to write the number of tests that MC/DC requires

You pay for what you get.

Added: footnote:[Some tools track variables appearing in conditionals that have previously been assigned expressions whose evaluation involved equality or relational operators.]

Figure 6.51

The lines for the models are difficult to read for everything except the blue.

I have improved the spread of colors, but over the range fitted the colors are what they are.

positive than negative tests,

Error and non-error tests?

Google returns 288K hits for "negative test" software filetype:pdf and 37K hits for "error test" software filetype:pdf

Added a footnote: "footnote:[Sometimes known as error and non-error tests.]"

What did they find?

Had to get my brain into gear for this one. Updated text to:

A study by Do, Mirarab, Tahvildari and Rothermel investigated test case prioritization, and attempted to model the costs involved; multiple versions of five programs (from 3 to 15 versions) and their respective regression suites were used as the benchmark. The performance of six test prioritization algorithms were compared, based on the number of seeded coding mistakes detected when test resource usage was limited (reductions of 25, 50 and 75% were used). The one consistent finding was that the number of faults experienced decreased as the amount of test resource used decreased, there were interactions between program version and test prioritization algorithm; see rexample[reliability/fse-08.R].

I find axes expressed in scientific notation to be difficult to read (across all the figures in the book, not just this one figure).

Yes, R axis labels are stuck in the 1970s. I am slowly fixing things. The magixaxis package will be used to use proper scientific notation for the stubbon cases.

developer mistakes (figure 6.9 suggests that most of the code containing mistakes is modified/deleted before a fault is reported).

I'm not sure that's what figure 6.9 shows.

A survival curve of a particular coding mistake being flagged; covered in 150 pages time.

Code flagged by static analysis isn't necessarily wrong (false positives).

The two cases flagged were selected such that it was unlikely they would be false positives.

Could figure 6.9 code from the early life of those projects when there was major code churn?

Samba is a mature project, don't know about pixie.

at the time of writing there is little if any evidence available showing that any construct is more/- less likely to have some desirable characteristic, compared to another, e.g., less costly to

Little to no evidence? For any construct? That seems unlikely. Of course, we'd always like to see more (and better) research but I don't think we are starting from zero in 2020.

We are mor eor less starting with nothing in 2020, and probably for years to come, since this form of experimentation is not popular with researchers.

In "Code Complete" Steve McConnell lists several studies in support of his recommendation for limiting routines to 150-200 lines and against very short routines.

The existing 'magic' length effect studies have all been debunked. There might be an effect, but the existing studies don' find one.

But the whole book is full of references to research on comments, variable naming, complexity, information hiding, coupling, defensive programming, etc.

Full of references to papers describing shoddy experiments involving small number of students, short programs and stone-age statistics.

What am I missing?

That people take a citation on trust and don't go and check it out.

formatting problem?

Yes, one of the things I said occurs in unreleased versions.

Figure ??

Another page boundary issue.

Figure 7.26 shows kind of source changes ranked by percentage occurrence, and exponentials fitted over a range of ranks (red lines).

I don't understand the significance of the figure.

The decrease in percentage is not very steep. So lots of changes are common, i.e., this is not an 80/20 situation. The fitted regression lines is an odd pattern, no ideas there.

Given that most functions are only ever modified by the original author (see fig 7.15), the primary beneficiary of any investment in naming of local identifiers is likely to be the developer who created them.

I'm not so sure this is true.

For writing it's true.

In my experience code is read many more times than it is modified. And more code lines of code are read than modified for any given coding session.

This may be true. But reading is not recorded in the data. How much reading of other people's code occurs? Don't know, but I suspect not as much as people think.

885 appears to be behind a paywall. I couldn't find a free version. "The {Linux} Kernel as a Case Study in Software Evolution" is available on semantischolar.

Two source code metrics, proposed in the 1970s, have become established folklore within

"In summary, there is no strong linear correlation between CC and SLOC of Java methods, so we do not conclude that CC is redundant with SLOC."

Ok, it only explains 85%, not 100% of the variance, see Figure 7.35 The correlation is not linear, the power is 1.1 rather than 1.

The report is on a Princeton library server that comes and goes. https://docs.lib.purdue.edu/cstech/index.14.html