Reviewer 2 - Githubissues

In the introduction, the authors said that to calculate product metrics of 700k commits, the process required 500 CPU-days while it required 10 times less for process metrics. However, it is not clear how many metrics are considered in one case and how many in the opposite. Moreover, it is quite strange that the product metrics require 10 times more than the process metrics. Usually, many of the process metrics can be built on top of the product metrics considering the evolution of the latter in time. Are you counting also the time to extract product metrics? Do you investigate why in your case this was surprisingly the opposite? Have you adopted a cached approach to avoid recalculating product metrics every time? Perhaps this point is just an engineering task not crucial but clarifying it can help to understand the adopted approach.
My second concern regards the prediction granularity. To what granularity the prediction is performed? It is not clear by reading the methodology to what granularity level the authors are performing the defect prediction evaluation. What do you consider defective? Is it an entire package, a class, a file, a method, or a line? The granularity of the prediction impact drastically on the quality of the results. Of course, a fine-grained granularity is what developers wish, but on the other hand, it is challenging to achieve. For example see *calikli2009effect.
My third concern regards the outcome and the lesson learned from this study. I have the impression that interest of this study is limited by its nature. The authors mainly focus on machine learning classifiers such as SVM, Naive Bayes, Logistic Regression, and Random Forest excluding deep learning approaches that for a large scale study seems to be more promising. See yang2015deep In addition, this work is merely cited in the "3.4 Evaluation Criteria" section in regards to recall more than the use of deep learning. To know more about the topic see also chen2019deepcpdp, hoang2019deepjit, qiao2020deep. Since the majority of the effort for conducting this study is already available (the metrics are already extracted). It would be fantastic extending this study by comparing current results for in-the-large with a deep-learning model that leverages the availability of a huge amount of data for the training.
Another limitation may affect the sets of metrics, authors arbitrarily decide to investigate the debate about product and process metrics, but it may be interesting also to consider other kinds of metrics that may influence the performance in-the-large. For example, radjenovic2013software, pascarella2020performance, *li2018progress suggest and compare also other sets of metrics to understand whether different sets can achieve different results. Have you considered to extend your pool of metrics with additional non conventional metrics?
An additional point that may require clarification regards the overlapping between product and process metrics. What is the role of these sets of metrics when they have to capture different aspects of the software? To what extent process-based overlap product-based results and vice versa? In other words, what is the amount of defects that are caught only by a single set of metrics only?
Some clarifications are needed from the validation perspective. The authors use cross- and released-based validation strategies. That's good for giving an overview of the model behavior, however, while the first is promising for getting a rapid summary of the results it may misleading during the training and the testing phases. See *pascarella2020performance. It would be outstanding to read more about the countermeasures used to address such a limitation.
Authors use several filtering criteria such as the minimum number of commits, pull-requests, issues, etc. However, some of the criteria seem not properly defined and only randomly chosen. Since the goal of the study is to understand the behavior of metrics in-the-large, I would like to read more about the reasons behind these choices that can contribute to the definition of what is so-called large. For instance. the authors are selecting as good representative projects for a large scale study all those projects that have at least 20 commits, 10 defective commits, and 50 weeks of development. Such a selected project implies that half of the commits are defective and a commit is released on average every 2 or 3 weeks. Is this representative a project in a good state? See also *bird2009promises for git related thoughts.
What about forked projects? On GitHub open projects can be forked by any users. This increases the number of mined project if forked projects are not removed from the query results. How do you deal with forked projects? Do those 700 results contain forked projects? How do you recognize the original and forked project?
Release selection. How do you use GitHub API to select releases? Do you also considered tags and branches? What king of heuristic are you using for identify releases?
Finally, while I generally appreciate this work's results, it leaves me with some doubts that I wish to clarify before battle in favor of it. In conclusion, the authors claim that this work can shed light on which machine learning method is more promising when used in a realistic scenario in-the-large. However, due to the ease to use, the always increasing computational power, and the huge availability of data (that is also the strongest point of this paper), neural network seems to be the future also in defect prediction see *wang2018fast. The evaluation of a neural network model may allow authors to extend their advice on which prediction model to choose while designing defect prediction tools.

@inproceedings{calikli2009effect, title={The effect of granularity level on software defect prediction}, author={Calikli, Gul and Tosun, Ayse and Bener, Ayse and Celik, Melih}, booktitle={2009 24th International Symposium on Computer and Information Sciences}, pages={531--536}, year={2009}, organization={IEEE} }

@inproceedings{yang2015deep, title={Deep learning for just-in-time defect prediction}, author={Yang, Xinli and Lo, David and Xia, Xin and Zhang, Yun and Sun, Jianling}, booktitle={2015 IEEE International Conference on Software Quality, Reliability and Security}, pages={17--26}, year={2015}, organization={IEEE} }

@article{chen2019deepcpdp, title={Deepcpdp: Deep learning based cross-project defect prediction}, author={Chen, Deyu and Chen, Xiang and Li, Hao and Xie, Junfeng and Mu, Yanzhou}, journal={IEEE Access}, volume={7}, pages={184832--184848}, year={2019}, publisher={IEEE} }

Directly uses the codes to extract features, thus it is not very useful to compare against process vs product debate.

@inproceedings{hoang2019deepjit, title={DeepJIT: an end-to-end deep learning framework for just-in-time defect prediction}, author={Hoang, Thong and Dam, Hoa Khanh and Kamei, Yasutaka and Lo, David and Ubayashi, Naoyasu}, booktitle={2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)}, pages={34--45}, year={2019}, organization={IEEE} }

@article{qiao2020deep, title={Deep learning based software defect prediction}, author={Qiao, Lei and Li, Xuesong and Umer, Qasim and Guo, Ping}, journal={Neurocomputing}, volume={385}, pages={100--110}, year={2020}, publisher={Elsevier} }

@article{radjenovic2013software, title={Software fault prediction metrics: A systematic literature review}, author={Radjenovi{\'c}, Danijel and Heri{\v{c}}ko, Marjan and Torkar, Richard and {\v{Z}}ivkovi{\v{c}}, Ale{\v{s}}}, journal={Information and software technology}, volume={55}, number={8}, pages={1397--1418}, year={2013}, publisher={Elsevier} }

@article{pascarella2020performance, title={On the performance of method-level bug prediction: A negative result}, author={Pascarella, Luca and Palomba, Fabio and Bacchelli, Alberto}, journal={Journal of Systems and Software}, volume={161}, pages={110493}, year={2020}, publisher={Elsevier} }

@article{li2018progress, title={Progress on approaches to software defect prediction}, author={Li, Zhiqiang and Jing, Xiao-Yuan and Zhu, Xiaoke}, journal={IET Software}, volume={12}, number={3}, pages={161--175}, year={2018}, publisher={IET} }

@inproceedings{bird2009promises, title={The promises and perils of mining git}, author={Bird, Christian and Rigby, Peter C and Barr, Earl T and Hamilton, David J and German, Daniel M and Devanbu, Prem}, booktitle={2009 6th IEEE International Working Conference on Mining Software Repositories}, pages={1--10}, year={2009}, organization={IEEE} }

@article{zhang2017machine, title={From machine learning to deep learning: progress in machine intelligence for rational drug discovery}, author={Zhang, Lu and Tan, Jianjun and Han, Dan and Zhu, Hao}, journal={Drug discovery today}, volume={22}, number={11}, pages={1680--1685}, year={2017}, publisher={Elsevier} }

@article{wang2018fast, title={A fast and robust convolutional neural network-based defect detection model in product quality control}, author={Wang, Tian and Chen, Yang and Qiao, Meina and Snoussi, Hichem}, journal={The International Journal of Advanced Manufacturing Technology}, volume={94}, number={9-12}, pages={3465--3471}, year={2018}, publisher={Springer} }

ai-se / process-product

Reviewer 2 #3