Paper Review: An Empirical Study on Noisy Label Learning for Program Understanding

Publisher

ICSE

Link to The Paper

https://dl.acm.org/doi/abs/10.1145/3597503.3639217

Name of The Authors

Wang, Wenhan, Yanzhou Li, Anran Li, Jian Zhang, Wei Ma, and Yang Liu

Year of Publication

2024

Summary

This paper conducts a comprehensive empirical study on how noisy labels impact deep learning models for program understanding tasks, and evaluates the effectiveness of various noisy label learning (NLL) approaches in improving model robustness and detecting mislabeled samples.

The study covers three different program understanding tasks:

Program classification (classifying programs into categories)
Vulnerability detection (classifying code as vulnerable or not)
Code summarization (generating natural language summary for code)

For the program classification task, the authors inject two types of synthetic label noise (random and flip) into a clean dataset and study the impact on model performance both with and without NLL approaches.

For vulnerability detection and code summarization, they evaluate NLL on datasets that contain real-world label noise. The study includes evaluations on both small trained-from-scratch neural networks as well as large pre-trained transformer models frequently used in software engineering.

Key Findings:

Small trained-from-scratch models are prone to label noise in program understanding while large pre-trained models are more robust
NLL approaches significantly improve program classification accuracy for small models on noisy training data but only provide slight benefits for large pre-trained models
NLL can effectively detect synthetic label noise but struggles more with detecting real-world noise in the datasets studied

Contributions of The Paper

This is the first comprehensive empirical study of noisy label learning for both classification and generation style program understanding tasks. Previous works focused only on classification.
The study evaluates NLL approaches on improving downstream task performance in addition to just detecting noisy samples, providing a more complete picture of their effectiveness.
By covering both synthetic and real-world label noise, small and large models, and multiple tasks, the study provides insights into the strengths and limitations of existing NLL methods when applied to software engineering.
The findings can help guide researchers on when NLL may be beneficial and shed light on areas for future work in tackling label noise in software engineering datasets.

Comments

Very important for our work

Good replication package with relevant techniques and models for our research.
This, in conjunction with the MSR paper (#84), will be the base for the next work!
Look at this paper for the experimental setup and how to structure the experiments.
Better understand real-world noise and build a technique to address and detect it from the datasets in this paper!

RAISEDAL / RAISEReadingList