Since deep learning (DL) can automatically learn features from source code, it has been widely used to detect source code vulnerability. To achieve scalable vulnerability scanning, some prior studies intend to process the source code directly by treating them as text. To achieve accurate vulnerability detection, other approaches consider distilling the program semantics into graph representations and using them to detect vulnerability. In practice, text-based techniques are scalable but not accurate due to the lack of program semantics. Graph-based methods are accurate but not scalable since graph analysis is typically time-consuming.
In this paper, we aim to achieve both scalability and accuracy on scanning large-scale source code vulnerabilities. Inspired by existing DL-based image classification which has the ability to analyze millions of images accurately, we prefer to use these techniques to accomplish our purpose. Specifically, we propose a novel idea that can efficiently convert the source code of a function into an image while preserving the program details. We implement VulCNN and evaluate it on a dataset of 13,687 vulnerable functions and 26,970 non-vulnerable functions. Experimental results report that VulCNN can achieve better accuracy than eight state-of-the-art vulnerability detectors (i.e., Checkmarx, FlawFinder, RATS, TokenCNN, VulDeePecker, SySeVR, VulDeeLocator, and Devign). As for scalability, VulCNN is about four times faster than VulDeePecker and SySeVR, about 15 times faster than VulDeeLocator, and about six times faster than Devign. Furthermore, we conduct a case study on more than 25 million lines of code and the result indicates that VulCNN can detect large-scale vulnerability. Through the scanning reports, we finally discover 73 vulnerabilities that are not reported in NVD.
VulCNN consists of four main phases: Graph Extraction, Sentence Embedding, Image Generation, and Classification.
We first collect a dataset from Software Assurance Reference Dataset (SARD) ( https://samate.nist.gov/SRD/index.php) which is a project maintained by National Institute of Standards and Technology (NIST) (https://www.nist.gov/). SARD contains a large number of production, synthetic, and academic security flaws or vulnerabilities (i.e., bad functions) and many good functions. In our paper, we focus on detecting vulnerability in C/C++, therefore, we only select functions written in C/C++ in SARD. Data obtained from SARD consists of 12,303 vulnerable functions and 21,057 non-vulnerable functions.
Moreover, since the synthetic programs in SARD may not be realistic, we collect another dataset from real-world software. For real-world vulnerabilities, we consider National Vulnerability Database (NVD) (https://nvd.nist.gov) as our collection source. We finally obtain 1,384 vulnerable functions that belong to different open-source software written in C/C++. For real-world non-vulnerable functions, we randomly select a part of the dataset in "Deep learning-based vulnerable function detection: A benchmark" which contains non-vulnerable functions from several open- source projects. Our final dataset consists of 13,687 vulnerable functions and 26,970 non-vulnerable functions.
Normalize the code with normalization.py (This operation will overwrite the data file, please make a backup)
python ./normalization.py -i ./data/sard
Prepare the environment refering to: joern you can try the version between 1.1.995 to 1.1.1125
# first generate .bin files
python joern_graph_gen.py -i ./data/sard/Vul -o ./data/sard/bins/Vul -t parse
python joern_graph_gen.py -i ./data/sard/No-Vul -o ./data/sard/bins/No-Vul -t parse
# then generate pdgs (.dot files)
python joern_graph_gen.py -i ./data/sard/bins/Vul -o ./data/sard/pdgs/Vul -t export -r pdg
python joern_graph_gen.py -i ./data/sard/bins/Vul -o ./data/sard/pdgs/No-Vul -t export -r pdg
Refer to sent2vec
./fasttext sent2vec -input ./data/data.txt -output ./data/data_model -minCount 8 -dim 128 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000 -maxVocabSize 750000 -numCheckPoints 10
(For convenience, we share a simple sent2vec model here|baidu or here|google trained by using our sard dataset. If you want to achieve better performance of VulCNN, you'd better train a new sent2vec by using larger dataset such as Linux Kernel.)
Generate Images from the pdgs with ImageGeneration.py, this step will output a .pkl file for each .dot file.
python ImageGeneration.py -i ./data/sard/pdgs/Vul -o ./data/sard/outputs/Vul -m ./data/data_model.bin
python ImageGeneration.py -i ./data/sard/pdgs/No-Vul -o ./data/sard/outputs/No-Vul -m ./data/data_model.bin
Integrate the data and divide the training and testing datasets with generate_train_test_data.py, this step will output a train.pkl and a test.pkl file.
# n denotes the number of kfold, i.e., n=10 then the training set and test set are divided according to 9:1 and 10 sets of experiments will be performed
python generate_train_test_data.py -i ./data/sard/outputs -o ./data/sard/pkl -n 5
python VulCNN.py -i ./data/sard/pkl
Yueming Wu, Deqing Zou, Shihan Dou, Wei Yang, Duo Xu, and Hai Jin.
If you use our dataset or source code, please kindly cite our paper:
@INPROCEEDINGS{vulcnn2022,
author={Wu, Yueming and Zou, Deqing and Dou, Shihan and Yang, Wei and Xu, Duo and Jin, Hai},
booktitle={2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)},
title={VulCNN: An Image-inspired Scalable Vulnerability Detection System},
year={2022},
pages={2365-2376},
doi={10.1145/3510003.3510229}}