Open Suvodeep90 opened 4 years ago
The logistic regression learner (since it is relatively fast)". What does "fast" mean? How was "fast" computed? 1) Here fast means the model takes less time to build a predictor using the same data source than other ML models such as random forest or support vector machine. 2) Need to cite that LR is a learner of choice for many defect prediction and transfer learning approaches.
"The SMOTE class imbalance correction algorithm [49], which we run on the training data2". Why is SMOTE needed? Why is the problem unbalanced? Please, explain. 1) Smote is needed as the datasets are imbalanced. SMOTE is a *** technique to handle data imbalance problems. 2) The dataset is imbalanced as the collected projects either had too little defect or too many defective modules. This can result in a high pf or low recall, but if we use SMOTE to balance the dataset, then there is a chance of better model being created. cite amrut's better data than bettwe data miner paper and other papers in SE/defect prediction to show SMOTE helps.
"and Hall’s CFS feature selector". Which features have been removed and why? explain CFS and the features which have been removed are the ones which don't increase the correlation-based subset evaluation function used in CFS. This means the attributes which are removed are the ones with the most correlation with the class variable.
"As to CFS, we found that without it, our recalls were very low and we could not identify which metrics mattered the most". Where can the reader understand this statement? Where is the replication data?
Not sure "Where can the reader understand this statement?", But the replication package with data has been included and has been mentioned in the contribution section.
"extensive studies have found that CFS more useful than many other feature subset selection methods such as PCA or InfoGain or RELIEF". The paper cites just one paper, how should these studies be "extensive"? Need to include more citations.
"Maximize recall and precision and popt(20)": Why should these metrics be maximized? What is the practical value, e.g., is high recall always needed in practice, or should we prefer precision in the problems treated by the bellwether? What about popt(20)?
Need to include the definitions of recall, precision and popt(20) and show we need to maximize these to get better model performance.
"While minimizing false alarms and ifa auc": Again, please explain the practical relevance of the performance metrics employed.
Need to include the definitions of false alarms and ifa auc and show we need to minimize these to get better model performance.
More in general, how can one replicate the performed study? Is there a replication package? If so, where? More details about the method need to be written and a replication package is there in the contribution section.
The clarity and replicability of the paper are unclear. Here some examples: