ICL-ml4csec / VulBERTa

Simplified Source Code Pre-Training for Vulnerability Detection
MIT License
76 stars 26 forks source link

Your version of the VulDeePecker dataset might be incorrect #4

Closed niklasrisse closed 1 year ago

niklasrisse commented 1 year ago

Hey,

I reproduced your experiments and noticed that your version of the VulDeePecker dataset might be incorrect. According to my analysis, for appr. 7.5k of the 16k functions in the test dataset there is an exact copy / duplicate in the training dataset. I think this might have been caused by an error during the data collection process. I think you might have extracted all the methods from the original C/C++ files of the VulDeePecker dataset, which contain lots of simple supporting methods that should not be part of the dataset. Also, the function names often indicate whether there is a vulnerability present in the function (e.g. 'goodB2G' indicates no vulnerability). This makes it much easier for a model to generalize from the training dataset to the test dataset compared to the original dataset.

hazimhanif commented 1 year ago

Hi.

Thanks for pointing out a potential problem with our version of the Vuldeepecker dataset. The duplicates in the dataset could stem from the nature of the original dataset itself. As we know, Vuldeepecker dataset is from a combination of Juliet Test suite and the NVD dataset. The problem with Juliet Test Suite, it contains simple synthetic test cases generated with obvious function names e.g. ("_CWE114_Process_Control__01bad","badSource","bad"...etc). SARD test suites

This, for instance already added the element of unrealistic function naming and potentially token-based biased (good/bad) tokens to the model. I agree that this makes the model easier to generalize. However, we decided to proceed as is because we tried not to modify any of the raw code, including the function name. We would like to see whether VulBERTa could be able to overcome this kind of "obvious tokens"/bias problem. Based on it, we don't think so.

With that being said, we also concluded that the reason VulBERTa was performing well on the Vuldeepecker dataset was because of the model leveraged the aforementioned issues. This also raises issues such as overall dataset quality and model maturity in DL/ML for vulnerability detection, especially for C/C++.

Thank you for the feedback.