The study by Croft et al. addresses the crucial issue of label noise in software vulnerability datasets and its impact on the performance of Software Vulnerability Prediction (SVP) models. The authors assert that obtaining perfect ground truth labels for software vulnerabilities is infeasible in practice due to the difficulty in verifying the absence of vulnerabilities and the limited manual effort available. They investigate the effectiveness of Noisy Label Learning (NLL) techniques in handling label noise and improving SVP models.
The authors manually curated a dataset from Mozilla Firefox, covering 22 releases over three years of development. They identified reported vulnerabilities and manually verified the affected files in prior releases. Additionally, they attempted to identify noisy labels within the non-vulnerable code by considering latent vulnerabilities discovered in future releases. The dataset characterization revealed that label noise is a significant factor, with over three times as many latent vulnerabilities observed compared to reported vulnerabilities.
The study evaluates various NLL techniques and proposes a two-stage learning method based on noise-cleaning. The first stage involves identifying and remediating noisy samples, while the second stage focuses on training the SVP model using the cleaned dataset. The proposed method improved the AUC and recall of baseline models by up to 8.9% and 23.4%, respectively.
The authors also discuss the instance-dependent nature of software vulnerability label noise, which is heavily skewed towards the negative class. They highlight the challenges in achieving an upper-bound performance even with semi-omniscient knowledge of label noise.
Contributions of The Paper
Comprehensive investigation and characterization of software vulnerability label noise
Pioneering study on the adoption of NLL methods in the software engineering domain
Proposal of a two-stage noise label learning approach
Insights into the nature of software vulnerability label noise: The authors observe that the label noise is heavily skewed towards the negative class, with a higher prevalence of latent vulnerabilities than reported ones.
Comments
Very important paper for our future work!!
The observation that software vulnerability label noise is label-dependent and skewed towards the negative class should be considered when selecting and adapting NLL techniques for our fault localization problem.
The discussion on the infeasibility of obtaining perfect ground truth labels highlights the importance of using NLL techniques to handle label noise in our fault localization dataset.
The findings on the impact of latent vulnerabilities on classifier decision boundaries underscore the need for using XAI mechanisms to interpret and validate the predictions of our fault localization model.
This paper has a lot of arguments to support our methodology and motivation - use of instance-dependent over instance-independent label noise, label impurities cannot be estimated accurately using noise transition matrix, heavy skew towards the negative class also needs to be considered in the technique, weakness of confidence learning when the classes are not properly separable.
Publisher
MSR
Link to The Paper
https://dl.acm.org/doi/abs/10.1145/3524842.3528446
Name of The Authors
Croft, Roland, M. Ali Babar, and Huaming Chen
Year of Publication
2022
Summary
The study by Croft et al. addresses the crucial issue of label noise in software vulnerability datasets and its impact on the performance of Software Vulnerability Prediction (SVP) models. The authors assert that obtaining perfect ground truth labels for software vulnerabilities is infeasible in practice due to the difficulty in verifying the absence of vulnerabilities and the limited manual effort available. They investigate the effectiveness of Noisy Label Learning (NLL) techniques in handling label noise and improving SVP models.
The authors manually curated a dataset from Mozilla Firefox, covering 22 releases over three years of development. They identified reported vulnerabilities and manually verified the affected files in prior releases. Additionally, they attempted to identify noisy labels within the non-vulnerable code by considering latent vulnerabilities discovered in future releases. The dataset characterization revealed that label noise is a significant factor, with over three times as many latent vulnerabilities observed compared to reported vulnerabilities.
The study evaluates various NLL techniques and proposes a two-stage learning method based on noise-cleaning. The first stage involves identifying and remediating noisy samples, while the second stage focuses on training the SVP model using the cleaned dataset. The proposed method improved the AUC and recall of baseline models by up to 8.9% and 23.4%, respectively.
The authors also discuss the instance-dependent nature of software vulnerability label noise, which is heavily skewed towards the negative class. They highlight the challenges in achieving an upper-bound performance even with semi-omniscient knowledge of label noise.
Contributions of The Paper
Comments
Very important paper for our future work!!