Paper Review: Code Comment Inconsistency Detection Based on Confidence Learning

Publisher

TSE - 2024

Link to The Paper

https://ieeexplore.ieee.org/abstract/document/10416264

Name of The Authors

Zhengkang Xu; Shikai Guo; Yumiao Wang; Rong Chen; Hui Li; Xiaochen Li; He Jiang

Year of Publication

2024

Summary

This paper proposes a novel approach called MCCL (Method-Comment Confidence Learning) to accurately detect inconsistencies between code and comments. MCCL consists of two main components: (1) Method Comment Detection (MCD) and (2) Confidence Learning Denoising (CLD). The MCD component employs sequence, AST, and multi-head attention mechanisms to capture the intricate relationships between code changes and comments. The CLD component identifies and removes labelling errors and characterization noise from the dataset to enhance the quality of the training data. Experiments conducted on a public corpus of 40,688 examples from 1,518 open-source Java projects demonstrate that MCCL outperforms state-of-the-art methods, achieving an average F1-score of 82.6% and an average Accuracy of 83.7%.

Contributions of The Paper

The authors propose MCCL, a novel method for detecting inconsistencies between code and comments. MCCL addresses the challenges of long-term information dependence and noise in code comment datasets by utilizing a method comment detection component and a confidence learning denoising component.
Extensive experiments are conducted to evaluate the effectiveness of MCCL. The results show that MCCL can accurately detect inconsistencies between code and comments, significantly outperforming the baseline methods in Precision, Recall, F1 score, and Accuracy.

Comments

I have severe doubts about the methodology. It seems pretty similar to the work by Panthaplackel et al. ("Deep just-in-time inconsistency detection between comments and source code"). The claimed representation was used in the previous work.
The paper's MCD follows the exact same schema as Panthaplackel et al.'s. The use of the edit actions to represent the difference and the use of Bi-GRU and GGNN to represent the source code sequence and AST seem derivative.
CLD seems relatively novel; the component works by first training the MCD component on the original noisy dataset, which outputs a prediction probability for each example, indicating the likelihood of the example being positive (i.e., inconsistent code and comment). These prediction probabilities and the original labels are then fed into the CLD component, which employs probabilistic thresholding to identify noisy examples by setting a threshold based on the prediction probabilities and the original labels. Examples with prediction probabilities that deviate significantly from their original labels are considered noisy. Additionally, the CLD component uses example ranking to prioritize clean examples by ranking them based on their prediction probabilities and assigning higher weights to examples with probabilities closer to their original labels, as these are more likely to be clean. The noisy examples identified by probabilistic thresholding are removed from the dataset, and the remaining examples are re-weighted based on the example ranking. Finally, the cleaned and re-weighted dataset is used to retrain the MCD component, which can learn more effectively from the high-quality examples.

The replication package was made available (as written in the paper), but then it was deleted and now shows 404, so that further raises my doubts about this work. I'm not exactly sure why the replication package is not available! The replication package was made available (as written in the paper), but then it was deleted and now shows 404 which further raises my doubts about this work.

RAISEDAL / RAISEReadingList