reply to emse - Githubissues

DEAR AUTHORS-- THIS TEXT NOW MOVED TO PAPER

Dear authors,

Three knowledgable reviewers have scrutinized the manuscript
and found it interesting and relevant,  although in need for
certain improvements. Therefore, the authors are
 invited to submit a revised version for a new review, where
 the reviewers' comments
are adressed. Particularly, this holds for:

- the connection to activities of theory building in ESE
(reviewers 1).

[x ] TM

This new draft now maps our work into the SE literature on theory building.

- clarification about the data origin and
characteristics (reviewer 1).

[x ] TM

This new draft now maps not contains extensive notes on the origin and characteristics of our data.

- use of statistical analysis (reviewers 2 and 3)

[ x] TM

All our results are now augmented with statistical significance tests as well as effect size tests. Those tests greatly strength the overall message of this paper.

- writing style, that would benefit form being more
precise (reviewer 3)

[ ] TM >

Reviewer #1: PAPER SUMMARY:

One of the most widespread beliefs in software engineering
is that the sooner you identify and resolve  an issue the
less effort it requires (the authors call it DIE).  This is
also being taught in software engineering education; fix
something in the r equirements phase and it is a quick fix,
wait until testing and you will have a to debug source code,
modify design, update documentation, and then change the
requirement - a much more complex task. This rule-of-thumb
was expressed in the 80s, and much has changed in the
software business since, e.g., programming languages,
editors, agile development and stack overflow. To what
extent does DIE actually exist in modern software
engineering projects?

The authors investigate the DIE phenomenon quantitatively by
analyzing data from 171 completed projects. All included
projects used Team Software Process (TSP), a development
methodology developed by the Software Engineering Institute
at Carnegie Mellon University. A central component in TSP is
detailed time logging, i.e., a TSP developer reports how
much time he spends on various tasks including issue
resolution. Thanks to access to a large dataset of completed
TSP-driven projects, the authors have been able to analyze
effort spent on issues in different development phases.

The results show that DIE does not apply to all projects,
and the effect is not as strong as previous publications
suggest. The authors thus conclude that DIE cannot be
assumed to always hold.  Furthermore, they present five
explanations of the observed lack-of-effect: i) DIE is a
historic relic, ii) DIE is intermittent, iii) DIE applies
only to very large systems, iv) DIE has been mitigated by
modern development methodologies, and v) DIE has been
mitigated by modern development tools. Finally, the authors
state the relevant question: how could DIE become such a
widespread idea in the literature, despite limited empirical
evidence?

REVIEW SUMMARY: + The paper investigates whether a widely
accepted "rule" in software engineering holds by looking at
a large set of empirical data. The research topic is highly
relevant and the results are interesting.  + The discussion
on reassessing old truism fits the scope of the journal very
well.  + The language in the paper is excellent, and the
structure is (although non-standard) easy to follow.

[ ] TM Thank you for those comments.

- It is not easy for the reader to understand what parts
of the "life of an issue" the authors claim are affected by
DIE, and it is also not too easy to understand what data the
authors have available. I believe adding another figure
could address both these aspects.  ```

[X] @llayman: We have added an example of the DIE in the Introduction, and more precisely defined how we measure this effect Section 6.

- The different projects studied are all grouped together. I
believe studying clusters of development contexts would be
highly interesting, but unfortunately the current
characterization of projects is inadequate.

[ ] @williamNichols: need some way to group the projects

- The authors do not relate their reassessment of an
"old truism" to the growing set of papers on theory building
in software engineering. With access to such large amounts
of empirical evidence, it should be possible to take some
steps toward improved DIE theories.

[ ] @fshull need words on this

DETAILED REVIEW: The authors should expand on what they
include in the "delayed issue effect" (DIE), in particular
in relation to standard milestones in the life of an issue
and the corresponding timespans. I would suggest adding a
figure to point out important issue milestones on a
timeline, e.g., 1) injected, 2) found/reported, 3) work on
resolution started, 4) work on resolution ended, 5) V&V
completed, 6) resolution deployed/integrated. I believe the
authors refer to an increase in time between 3) and 4) as
the DIE, but I'm not really sure. I'm particularly
interested in whether all "indirect" issue resolution steps
are covered, e.g., change impact analysis, updating
documentation, and regression testing. The effort involved
in these steps clearly depends on the development context of
the specific project. The short a-c) list on page 14:50
suggest that the issues studied are restricted to minor
fixes, i.e., no updates to architecture, changed hardware,
recertification, updated user manuals etc.

A figure with issue milestones could also help the reader
understand what data are actually available for the
analysis. Section 6.3 describes the data, but some aspects
should be further explained. Logging interruption time
sounds very useful, but I wonder how carefully the
developers actually did this -

[X] @llayman: We not provide a more explicit definition of what constitutes a defect (our study of issues is limited to defects) and the "defect lifecycle" phases we measure in Section 6.2.2. "a defect is any change to a product, after its construction, that is necessary to make the product correct". This includes both minor issues (typography) and severe issues (architecture changes, requirements errors). Time on a defect includes a) time to investigate/analyze a defect once discovered, b) time to craft and implement a fix, and c) time to verify and close the defect.

 it must be really hard to keep up the discipline required.
 If you are interrupted as a developer (e.g., phone calls,
 urgent mail, someone asks a question) I don't think the
 first thing you do is to stop a timer on your computer.
 Moreover, developers often work on several defects in
 parallel, and might interweave bug fixing with new
 development. I don't think the "interruption time" captures
 all such multi-tasking, and it should at least be properly
 discussed in Section 7 "Threats to validity".

[X] @llayman Yes, there is undoubtedly some accuracy in the time-tracking. We provide some evidence on the TSP data set's integrity in Section 6.3, and a more extensive discussion on the issue of time-tracking and defect logging accuracy in Section 7.1 on Conclusion Validity.

The discussion on reassessing old truisms (Section 3) is
interesting, but it should be complemented by the
perspective of theory building in software engineering - An
active research topic lately, with a dedicated workshop
series (GTSE). I suggest looking into the following
references for a start: Sjøberg et al. (2008) "Building
Theories in Software Engineering", Smolander and Päivärinta
(2013) "Theorizing about software development practices",
and Stol and Fitzgerald (2015) "Theory-oriented software
engineering". Considering new empirical evidence is
obviously critical to theory building, and discussing your
new results in the light of an theory creation would be
valuable. The authors mention that DIE appears to occur
intermittently in certain kinds of projects - maybe the
authors could elaborate on this idea and present an improved
DIE theory based on what they now know? I believe the
authors have the best available data to do so, and I would
expect the paper to go beyond simply questioning the "old
truth".

[ ] @fshull: is this material you know? can you do a page or two casting this work in terms of t: Sjøberg et al. (2008) "Building Theories in Software Engineering", Smolander and Päivärinta (2013) "Theorizing about software development practices", and Stol and Fitzgerald (2015) "T

The authors study three claims in this paper, and the
third claim is the central one: "delayed issues are not
harder to resolve". To study whether issues require longer
resolution times in later phases, the authors analyze a
large set of issue data from historical projects. Thus the
manuscript reports from an observational study rather than
an experiment with a controlled delivery of treatments.
While I believe the authors' approach is practical, I would
like to see a critical discussion on threats to validity of
observational studies (e.g., Madigan et al., A Systematic
Statistical Approach to Evaluating Evidence from
Observational Studies, Annu. Rev. Stat. Appl. 2014. 1:11-39
and Carlson and Morison, Study Design, Precision, and
Validity in Observational Studies, J Palliat Med. 2009 Jan;
12(1): 77-82). An alternative study design (although
difficult to realize in a system of industrial size) would
be to let different developers resolve the same issues for
the same software system, i.e., one group resolves an issue
during design, another during implementation, and a third
during testing - some discussion along these lines would
strengthen the validity section.

[X] @llayman We have reviewed and cited the mentioned papers. We have incorporated many of the concerns raised in these papers into our expanded Section 7.4 on External Validity.

The first and second claims are studied with much less
rigor. "DIE is a commonly held belief" is studied using a
survey of software engineers, both practitioners and
researchers. According to Fig. 2 (actually a table), the
number of respondents is 16 and 30 for practitioners and
researchers, respectively. Sixteen respondents from industry
represent a tiny survey of a very general claim that any
software engineer could respond to. Why were not more
answers collected? The authors do not have much evidence to
defend the first claim, and there is no discussion of the
corresponding validity threats. The second claim, "DIE is
poorly documented", is studied using a literature review.
Unfortunately, the method underlying the literature review
is not presented. Although it doesn't need to be an SLR of
the most rigorous kind, the authors should report how the
papers were identified. I suspect the terminology used to
describe the DIE phenomenon is highly diverse, thus it would
strengthen the paper if the authors reported how they
reviewed the literature. According to Page 19:34 only eight
publications were identified.

[ ] @llayman need to show that these were the only empirical results for DIE in tow

The iterative fashion of modern software development,
with agile at the extreme end, is not fully discussed (the
short discussion section could be extended). The phases of
the linear development of the 80s, (such as in Fig. 1)
probably don't exist in many of the projects in the TSP
dataset, still the authors discuss the DIE-effect from the
perspective of the 80s. Page 16:31 states "DIE has been used
to justify many changes to early lifecycle software
engineering" - does this mean the agile movement
successfully mitigated DIE? This possibility is not fully
considered in the paper.
How many of the 171 projects were
(more or less) agile? This appears to be an important
characteristic of the included projects - very important to
describe!

[X] @llayman An astute observation. We provide more information on the types of projects included in our sample in Section 6.4. More to the point -- does agile help mitigate DIE? The answer is - we don't know. A strong argument can be made that all iterative&incremental methodolgies are meant to mitigate the DIE. Further, huge advances in software engineering technology and process have occurred since DIE was first observed, many of which have precisely targeted the DIE (static analysis, test-frist development, automated build). Perhaps our analysis provides evidence that DIE can be defeated, at a large scale across multiple varied projects, using modern technologies. We do not yet know the causes here, but certainly it appears the DIE should not be treated as a truism any more, but rather a variable that can be controlled. We discuss this in Section 8.
[ ] @WilliamNichols : need a quesstimate of how many of the projects were "agile"

Concerning characterization of the 171 projects, the
paper needs to report much more detail. I would expect to
see some descriptive statistics. On page 17:18 the authors
say "perhaps we should split our SE principles into two sets
/---/" - of course SE practices need to be adapted to the
development context, and also two sets of principles is a
too simple split. The paper does not report much
characterization of the 171 projects. I strongly suggest the
authors to dig deeper into the data, and analyzing for which
types of projects the findings hold. What patterns are there
to discover? I suggest Petersen and Wohlin, "Context in
industrial software engineering research" (ESEM2009) for
details on how to characterize development contexts.
Moreover, I would really like to see what practitioners from
the 171 projects think of your findings - an interesting
option for a qualitative angle on your study.

[ ]@WilliamNichols : can you look at p402 and 403 of http://www.cse.chalmers.se/~feldt/courses/sple/papers/petersen_2009_context_in_industrial_research.pdf. do you have any statistics on our projects broken out according to the terms of that paper?

Fig 1: "Operation" dwarfs everything in Boehm's diagram.
Since you do not study anything post release in this paper,
I think this figure skews the reader's mental picture. If
you remove the rightmost bar and rescale accordingly, the
plot better matches the findings you report in Fig. 10. You
still identify an interesting result though, but the
presentation turns fairer.

[ ] @timm will do

MINOR COMMENTS: Keywords: I believe the keywords could be
improved to help readers find this work.

[ ] @llayman

Several figures are copied from previous work. Are all copyrights properly
managed?

[ ] @llayman

Fig 10: Black and red is not a good pair for grayscale printouts. Please use black and gray instead.


- [ ] @timm

Some figures are actually tables, thus their captions should be replaced accordingly.


- [ ] @llayman

Some figures should be resized to better match the page width.


- [ ] @llayman

Page 3:14 - "The above argument" Which argument? Could be precisely specified.


- [X]

Page 3:22 - Please provide the full link to the dataset.


- [ ] @timm

Page 4:8 - "More difficult" but "harder" is used in claim3. Why this inconsistency?

- [ ]

Page 18:26 - First sentence: "Unexpected results such as this one". Ambiguous reference, please be specific.


- [ ]

Page 18:27 - "We also survey the state of SW dev. practice /---/": given the size of the survey,
this statement feels a bit bold.

- [ ]

TYPOS ETC.
Sec 1, §1: [3] and [30] have been swapped?
Page 14:46 - Spell out IV&V the first (and only) time it appears. Page 3:24 - missing verb (is) Page 16:43 - "Did this study failed" Page 18:44 - Appache Page 19:19 - Al Ref 26 - oo --> OO


- [X]
_____________________________________________

Reviewer #2: The topic of this paper is the empirical evaluation of a largely admitted claim in software engineering, i.e. the exponential cost of correcting errors according to the phase in which the errors are discovered. The claim is first confirmed by surveying and interviewing practitioners and experts. Next, the authors use a large set of data from projects in SEI database. The results show that, strictly speaking, the claim does not hold (although some effects of delayed corrections can be noticed).

Globally, the paper is well written and the authors develop a convincing demonstration. The topic of the paper is highly relevant for both practitioners and people from academia. Among many, I personally believed in this claim and had regularly taught it in my courses. Thus, reading this paper offers a refreshing perspective on our understandings of software engineering background and theoretical knowledge. I also like the suggested posture that insists on the necessary skepticism that we should have against such prevailing claims.

Beyond this positive global impression, I have some concerns about the study reporting and analysis: - It seems to me that more descriptive statistics about the sample would contribute to a better understanding of the scope of the study.


- [ ] @timm more  stats

I think in particular about the size (lines of codes and/or number of software components); a duration histogram would also be relevant. Accordingly, the formula for calculating variable "Total effort" in Fig. 7 could be explained, and Figure 9 made more readable (bigger size).


- [X] @llayman We have added more description of the projects to Section 6.4. We have also added more precise definitions of our measures in Section 6.2. We have adjusted the size of figures throughout the paper to be more readable.

If certain attributes for the issues and errors are available in the data (e.g. severity, priority, etc.), they should also be brought to the readers knowledge.


- [ ] @WilliamNichols: got any stats on these

Figure 10 is central to the paper's demonstration; its expressiveness could be enhanced: i) the reader tends to think that the "right hand side bars" are different from the BLACK and RED bars, ii) the formula for calculating the 50th and 95 ratio could be provided, iii) the unit of column "Percentile" could be mentioned.


- [ ] @timm split into two

I like the discussion section, in particular, the idea of software architectures that could be a contributing factor for enhancing software evolution and reducing the cost of issues and errors fixing. Is there any available knowledge in the data set concerning any architectural design choices made in each project? If so, would it make sense to seek any correlation between these architectural choices and how expensive it was to solve issues?


- [ ] @WilliamNichols: anything on architectural style? or should
we jsut say "many and varied and hard to get a precise
picture of it all

In the same trend of ideas, I think what has fundamentally changed since early days of SE is the development of requirements engineering. In actual SE practices, problem and the solution spaces are systematically explored thanks to RE techniques; together with architecture, this could also explain why things have changed since the 70'. It could be, for example, that people tend to make less severe errors in early software project phases (thanks to RE techniques); a similar phenomenon was observed when SE practices become more mature (see Harter et al. 2012).

References Harter D. E., Kemerer C. F., Slaughter S. (2012). Does Software Process Improvement Reduce the Severity of Defects? A Longitudinal Field Study. IEEE Transactions on Software Engineering, vol. 38, n° 4, p. 810-827.


- [ ] @timm reference harter
- [X] @llayman Absolutely - there are potentially many explanations for why the DIE was NOT observed in our dataset. We discuss some more of the potential causes in Section 8.

In the conclusion section, I feel uncomfortable with the assertion "That data held no trace of the delayed issued effect" (line 17). As mentioned elsewhere in the paper, the delayed issue is not absent; it is much less significant and systematic than what is usually claimed. Moreover, this reduction has been demonstrated for medium sized projects; we cannot generalize to larger projects.


- [ ] @timm

Minor remarks: - p.9, line 29 : " … reported by Shull [52], found that the cost to find certain non-critical classes of defects …" => is it the "cost to find" or the "cost to FIX"?


- [X] @llayman It should be "cost to fix". We have corrected this.

p.10, line 21: duplicate word "there" - p.12, line19: missing word "One of THE guiding …" - p.16, line 33: "distinguish" instead of "distinguished"? - p.18, line 42: explain acronym MEAN - p.20, line 20: useless enumeration A1?


- [X] 
_______________________

Reviewer #3: I like the idea of the paper and I certainly would like to see more empirical studies that periodically check if our beliefs about software engineering practices still hold (or even if they have ever held). The authors address one of the beliefs that are more entrenched in the population of researchers and practitioners, and a very important one too.

At the same time, I think that the paper needs to be improved before it can be published.

The empirical analysis (Section 6) is a bit of a let down. I would like to see some sort of more robust and rigorous statistical analysis, based on statistical tests. Instead, the paper only provides qualitative comparisons between the results obtained at the 50-th and 95-percentiles. This is not to say that the results or the discussions are incorrect, but only that they need to be better supported. The lack of this kind of statistical analysis should have least be mentioned in the Threats to Validity. I understand that you say that your "claims" (including Claim 3) should not be considered "hypotheses" in the statistical sense and I agree with you on Claim 1 and Claim 2. However, Claim 3 lends itself to a statistical analysis, without which your results become a bit more anecdotal, and this is what you rightly criticize in the previous claims that DIE actually exists. In addition, you may want to analyze some additional and somewhat unexpected results that can be derived from your data (see Detail Comments below).


- [ ] do the stats

The paper seems to be written in a somewhat casual style, which makes the paper more pleasant to read, but less clear in some parts (see Detail Comments below).

DETAIL COMMENTS

Section 2

I'm not really sure what the authors mean by "We say that a measure collected in phase 1, ., i, .. j is very much more when that measure at phase j is larger than the sum of that measure in earlier phases 1 <= i < j." You are defining a property of a measure, but, the way this definition is written, it seems as if you are defining the property of a measure of "being much more," without any further qualifications. I can guess that you mean that a measure $m$ (collected in phases, so you can denote by $m(i)$ the value of $m$ in phase $i$) has this property in phase $j$ (so, it's a property m has in a specific phase and not a general property of the measure) if $\sum{1 <= i < i} m(i) < m(j)$. Then, it's up to you to make this a property of the measure, for example by using an existential "policy" ($m$ has this property if there exists one phase $j$ where $\sum{1 <= i < i} m(i) < m(j)$) or a universal one ($m$ has this property if for all phases $1<j$, $\sum_{1 <= i < i} m(i) < m(j)$).


- [ ] @timm just do that

The real problem with this definition, however, is that it doesn't seem to be used anywhere in the paper. The only point where it might be used is in the discussion of Figure 1 (which precedes it anyway). So, you may want to remove this property altogether.


- [ ] @timm your right. the real thing here is that here is
statistically singicicat increase

Your definition of "difficult issue" is "Issues are more difficult when their resolution takes more time or costs more (e.g. needs expensive debugging tools or the skills of expensive developers)." That's not a very precise definition, though, because it has two different interpretations, one in terms of time and the other in terms of effort. The reasons why you introduce this definition become clearer only in Section 7.2, in the Threats to Validity, and that's too late. You should move some of the discussion of Section 7.2 here.


- [ ] @timm roger

Why would the term "delayed issue effect" be a generalization of the rule "requirements errors are hardest to fix"? "Delayed" seems to be quite specific as it appears to refer to time, while "hardest to fix" may refer to other variables, like effort or cost.


- [ ] @timm text

In Claim 3, "very much more harder" sounds like a little bit too much ... ("more harder"?)


- [ ]

Section 4

A very minor issue: why is Figure 2 a ... figure, instead of a table? Same for all other tables ...


- [ ]

Section 5.1

This is not a complete sentence "All the literature described above that reports the onset of DIE prior to delivery."


- [X]

Section 6.3

You should rephrase your definition "a defect is any change," since a defect is what existed before the change was made and not the change itself.


- [X] @llayman We now provide a more precise definition of defect in Section 6.2.

You do not explain that "QualTest" in Fig. 8 is.


- [ ]

The definition of "time per defect - The total # of defects found in a plan item during a removal phase divided by the total time spent on that plan item in that phase." is not correct as it is, as this would basically be the number of defects per unit time, instead. The roles of time and defects should be reversed.


- [X] We now provide a more precise definition of time measurement related to defects in Section 6.2.

Section 6.4

Besides adding a more rigorous statistical analysis, you need improve your explanations.

Figure 9. The caption says "Distribution of defects found and fixed by phase." Is that the distribution of defects found in some phase and fixed in that phase? The text says "The distribution of defects found and fixed per phase in our data is shown in Figure 9. A high percentage of defects (44%) were found and fixed in the early phases, i.e., requirements, high level design, and design reviews and inspections." which doesn't add much. If that's the distribution of defects introduced in a phase and fixed immediately in that phase, it is not clear why you introduce it, though.


- [ ]

Figure 10. The explanation of the data and histograms in the figure is, at best, confusing. The caption says "50th and 95th percentiles of issue resolution times" and the opening sentence of Section 6.4 says "Figure 10 shows the 50th and 95th percentile of the time spent resolving issues ..." However, these are not "times," because you write "expressed as ratios of resolution time in the phase where they were first injected" a couple of paragraphs below. This seems to be the right interpretation, but right after that you write "The BLACK and RED bars show the increases for the 50th (median) and 95th percentile values, respectively," which would be a third interpretation of the data in Figure 10.


- [ ]

You should provide the value of the mean, in addition to the 50-th percentile (i.e., the median).


- [ ]

You need to at least mention (or, better, discuss) some of your results, because they appear to be "counterintuitive," since they show that, for example, median resolution times can even decrease if issues are not fixed immediately, but in later phases. For example, looking at the Reqts section, most percentages are way below 1, which seems to indicate that some issues are actually much easier to fix at later stages (maybe because more or better information about the software system is available only later). That would be a very interesting result by itself, especially if you can provide some statistical support for it.


- [ ] @timm really, no statisticall difference

Section 7.1

Maybe you could analyze the projects developed in a "traditional" way and check if there is some sort of DIE there. For example, if you look at the projects developed with the waterfall life cycle, maybe you will find larger DIE than for the other life cycles, which might partially justify the claims of previous researchers about the existence of DIE. This would also show that newer kinds of life cycles (like Agile ones) help solve DIE. You do provide hints about this in Section 8 ("We also note that other development practices have changed in ways that could mitigate the delayed issued effect") and Section 9, but that's clearly not enough.


- [ ] @williamNichols: can we separate out the agile and the trad?

Section 8

I'm not sure what you are really trying to discuss in this section, because you are not discussing the results of the paper. It sounds like you are saying that maybe it's the newer software development approaches that made DIE disappear?


- [ ]

- [ ] Verify that the Section #'s in our responses are accurate prior to submission.

ai-se / phaseDelay

reply to emse #28