Reviewer 3 - Githubissues

However, the study features a number of unexpected design choices that somehow over-simplify the study. In particular, relating to some extent to the authors' earlier work on local vs. global models, the methodology in this paper is extremely "global", basically comparing 700 projects head-to-head without distinction in terms of age, community size, activity, domain, #releases, etc. The high-level stats in Table 4, especially for process metrics, but also the data statistics (including the imbalanced nature of the defect data), suggest that a 700-project study somehow is comparing apples to oranges.
I believe that clustering the projects, then performing analyses within the clusters, would still yield an "in-the-large" analysis, but a more meaningful one at that. In particular, such a cluster-based analysis could validate the extent to which the in-the-large results are really due to the scale of the data vs. the context of specific projects (e.g., domain, size, age, time between releases, etc.). If context is the decisive factor, this would mean that the in-the-small results were actually correct in the context of the studied projects. Along the same lines, the current in-the-large results could be dominated by specific types of projects, leading in turn to incorrect generalizations.
Apart from clustering, the paper would improve substantially by manually checking some of the many outlier data points, since there could be very interesting stories or observations in those that might deepen the discussion of the RQs' results. Right now, the paper does not go very deep into the interpretation of the findings, often stopping at formulating a hypothesis for future work. For example, the time-aware models could lead to highly different results for projects with few releases compared to those with weekly releases.
Furthermore, the paper is rather vague regarding the actual value of in-the-large analysis. First of all, the initial pages of the paper do not define "in-the-large" well. In fact, the concept could easily be misread as meaning "pooling the data of all projects together, then build one model" instead of "building models for many projects, then pool their results". Second, the paper does not really discuss how companies should interpret the in-the-large findings. Do they need to care about the in-the-large results (since validated on larger sample), or only about the in-the-small results (since they will be building models for only few projects)? Do future papers from now on need to stick to in-the-large studies, or they should just make sure the use the in-the-large findings about metric importance and learning algorithms in their in-the-small study? In other words, what are really the implications of this work?
The literature review in Section 2.2 also raises a number of questions. It seems like the in-the-small papers (the larger population, not just the 2 reproduced papers) do not agree themselves on issues like the most important complexity metrics. As such, how do the results of this paper that disagree with the 2 reproduced papers relate to those other papers? Are there in-the-small studies that actually obtained similar findings as this in-the-large study (e.g., recommending random forest models)?
Regarding the choice of papers, what criteria were used by the authors to identify the "important or highly influential" papers that were added on top of the papers with sufficient citations? Similarly, what was the last paper added to the surveyed set? It would also be relevant to add the number of papers comparing or using process and product metrics that eventually were excluded, since this seems to be a much larger set than the paper suggests.
Another design choice that was unexpected was the lack of hyper-parameter tuning in the models, especially since the authors' prior work has stressed the importance of such tuning. The paper states that "the performanceof the default parameters was so promising that we left such optimization forfuture work", yet earlier studies have found unstable defect prediction results across hyper-parameter configurations, which could impact the results of the study.
Furthermore, in RQ8, the in-the-small models use a logistic regression learner instead of random forest. While this is based on earlier findings that such models work better in-the-small, the fact that this paper found random forest to be better based on in-the-large findings, could warrant the use of random forest for in-the-small as well. I guess this goes back to my earlier comment on how the in-the-large results should be used by practitioners.
Some other expressions like "A defect in software is a failure or error") or "commits which have bugs in them" (bug-introducing or -fixing?) should be rephrased.
Finally, it is not clear whether statistical tests are used on the boxplots in the earlier RQs and, if so, if Bonferroni correction is used, since there are typically 3 groups to compare to each other?
Other Questions

Page 2: Content: " unwise to trust metric importance results fromanalytics in-the-small studies since those change, dramatically when movingto analytics in-the-larg" Comment: OK, but why? The current motivation is not that strong.

Page 3: Content: "Now we can access data on hundreds to thousands of projects. Howdoes this change software analytics?" Comment: In what respect? Does this refer to pooling data of multiple projects together, or to building models for more projects? The motivation is rather vague at this point.

Page 3: Content: " For example, for 722,471 commitsstudied in this paper, data collected required 500 days of CPU (using fivemachines, 16 cores, 7days)." Comment: OK, but actual companies building a model for themselves would not need to do this on thousands of projects, hence the effort would be lower for them?

Page 6: Content: " in both released based and JIT based setting. A" Comment: These study settings should be mentioned more explicitly before the RQs.

Page 6: Content: "process metrics have significantly lower correlation than product metrics inboth released based and JIT based setting" Comment: Is this a good or a bad thing?

Table 2: Do metrics like "age" only apply to JIT models? What are "neighbors"? What is "recent"?

Table 3: It might be better to order the papers by year to prove the point about recent studies including relatively few projects in their data set.

Page 11: Content: "The papers in the intersectionare [60, 48, 24, 6] explore both process and product metrics." Comment: Where is the 5th one?

Page 12: Content: " more than 8 issues." Comment: Why 8?

Page 12: Content: " eight contributors." Comment: Why 8?

Page 12: Content: " modified version of Commit Guru [65] " Comment: What modifications were made?

Table 4: Are these metrics calculated on the last version of each repo, for java files only, or across all commits of all repos? The latter could bias the results to older, larger projects?

Page 13: Content: " using a keyword based search." Comment: What keywords were used?

Page 14: Content: " uses SZZ algorithm" Comment: Which SZZ implementation? Is the bug report date-heuristic used?

Page 14: Content: " use the release number, release date informationsupplied from the API to group commits into releases and thus dividingeach project the into multiple releases for each of the metrics." Comment: Did all projects have releases?

Page 14: Content: " or was changed in a defective child commit." Comment: Why?

Page 16: Content: "But by reporting on results fromboth methods, it is more likely that other researchers will be able to comparetheir results against ours. " Comment: Nice.

Page 20: Content: "see any significant benefit when accessing the performance in regards to thePopt20, which is another effort aware evaluation criteria used by Kamei et al.and this study." Comment: Somehow, even product metrics seem to perform equally well on this metric.

Page 21: Content: "With the exception of AUC" Comment: Popt20?

Page 23: Content: " evident from the results, thatfile level prediction shows statistically significant improvement " Comment: Supported by statistical test results?

Page 24: Content: " then check each of the 3 subsequent releases" Comment: In terms of what?

Page 24: Content: " see in both process based and product based models thePopt20 does significantly better in the third release" Comment: Perhaps many projects have only few releases?

Page 24: Content: " This basically means if either process or product metrics can capturesuch differences, then the metric values for a file between release R and R+1should not be highly correlated." Comment: Since process metrics capture the development process, would a low correlation imply changes in the process?

Page 24: Content: " Spearman correlation values for every file between two consecutivereleases for all the projects explored as a violin plot for each type of metrics." Comment: Basically, for each file there is one spearman correlation across all its metrics, then those correlations are aggregated across all files, all commits and all projects into one violin plot?

Page 27: Content: "indicate the models are proba-bility learning to predict the same set of files defective and finding the samedefect percentage in the test set as training set and it is not able to prop-erly differentiate between defective and non-defective files. " Comment: What is the difference with RQ5?

Page 27: Content: "Spearman rank correlation between the learned and predicted probability formodels built using process and product metrics." Comment: Is this analysis per file, then aggregated across all files and all projects?

Page 27: Content: " part 3 only contains files which are defective intraining and not in test set," Comment: The other way around?

Page 28: Content: " using both process and product metrics " Comment: "both" or "either"?

Page 30: Content: " sorted by the absolute value of their β-coefficients within thelearned regression equation." Comment: Are coefficients comparable across the different metrics in a logistic regression model? Why not use odds ratios?

Page 31: Content: " have relied on issues marked as a 'bug' or 'enhancement' to count bugsor enhancements" Comment: Which metrics leverage information about enhancements?

Page 32: Content: " took precaution to remove any pull merge requests from thecommits to remove any extra contributions added to the hero programmer." Comment: Any details about this?

Page 32: Content: " process metrics generate better predictors than process metrics" Comment: Something seems wrong in this sentence.

ai-se / process-product

Reviewer 3 #4