In p.1, l.27, what did not hold should be written in addition to what hold.
In p.11, l.43, the number of projects before selection should be written. The information is useful to know how 700 projects (i.e., those which could be analyzed with confidence) are large in GitHub space.
In Section 3.2, l.34, the statement "The performance of the default parameters was so promising…" is not confirmed by the following sections. Especially, SVM was not tuned and looked not promising. The usefulness of ensemble learning (i.e., RandomForest in this paper) is a key finding of this paper (p.3, l.15), and SVM must be tuned at least to defend the finding.
In p.16, l.26, the sentence "… in the order proposed by the learner, " brings some questions. First, does it mean that the learners return a probability of fault-prone?
If so, how SVM was configured? SVC of scikit-learn does not return a probability by default. An option "probability" needs to be set as True. The configuration must be written in Section 3.2.1. If not so, how the order was defined in Popt20 must be written in Section 3.4. Second, if learners returned a probability, how was it turned into a binary value?
In p.18, RQ1, it is unclear what data was used for plotting Figs.3-5. The cross-validation study repeated 5-fold cross-validation 5 times and yielded 5 x 700 results for each metrics setting. The release study had 3 releases to be tested and yielded 3 x 700 results for each metrics setting. Each boxplot of the figures was based on 700 projects. So, what summarization or aggregation was applied to the results?
In p.23, RQ5, it is unclear how the correlation between two consecutive releases was calculated. A file can be updated multiple times between releases. Thus, each of the two consecutive release data can have multiple instances attributed to the same file. There are some possible combinations among them to calculate a correlation. Therefore, how the correlation was calculated must be detailed to clarify.
In p.26, RQ6-7, it is unclear how training instances and test instances were linked. As described, there are some possible combinations if multiple instances are attributed to the same file. Also, some files only appeared in either of the training or test set. It must be detailed on how these cases were treated.
In p.29, l.23, it looks inappropriate to use logistic regression. The focus here is the difference between large scale study and small scale study. Using a different learner is noisy for the comparison. The analysis must be carried out with Random Forests, not logistic regression.
Some typos and mistakes to be fixed were also found as follows (not all):
P.3, l.16 "had" -> "hand"
P.4, l.45 "in-term" -> "in terms"
P.5, l.26 "then" -> "than"
P.9, l.25 "that" - "than"
P.17, l.48 "AUC" -> "Popt20"
P.18, l.17 "AUC" -> "Popt20"
P.18, l.48 "… where process but…" -> "… where using and…"
P.20, l.40 "AUC" -> "Popt20"
P.23, l.18 "build" -> "built"
P.23, l.28 "build" -> "built"
P.26, l.47 "defective in training and not in test set" -> "defective in test set and not in training set"
P.27, Fig. 10: Add labels "recall" and "pf" to Y-axis
P.29, l.9 "left to right" -> "bottom to top"