Vul-LMGNN / vul-LMGGNN

Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph
Apache License 2.0
42 stars 10 forks source link

Vul-LMGGNN

Code for the paper - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graph

Introduction

In this work, we propose Vul-LMGNN, a unified model that combines pre-trained code language models with code property graphs for code vulnerability detection. Vul-LMGNN constructs a code property graph, thereafter leveraging pre-trained code model to extract local semantic features as node embeddings in the code property graph. Furthermore, we introduce a gated code Graph Neural Network (GNN). By jointly training the code language model and the gated code GNN modules in Vul-LMGNN, our proposed method efficiently leverages the strengths of both mechanisms. Finally, we use a pre-trained CodeBERT as an auxiliary classifier. The proposed method demonstrated superior performance compared to six state-of-the-art approaches.

Getting Started

Create environment and install required packages for LMGGNN

Install packages

The experiments were executed on single NVIDIA A100 80GB GPU. The system specifications comprised NVIDIA driver version 525.85.12 and CUDA version 11.8.

Dataset

We evaluated the performance of our model using four publicly available datasets. The composition of the datasets is as follows, and you can click on the dataset names to download them. Please note that you need to modify the code in the CPG_generator function in run.py to adapt to different dataset formats.

Dataset #Vulnerable #Non-Vulnerable Source
DiverseVul 18,945 330,492 Snyk,Bugzilla
Devign 11,888 14,149 Github
VDSIC 82,411 119,1955 Github, Debian
ReVeal 1664 16,505 Chrome, Debian

Usage

Some tips:
Preparing the CPG :
python run.py -cpg -embed -mode train -path /your/model/path

-cpg and -embed respectively represent using joern to extract the code's CPG and generating corresponding embeddings. -path is used to specify the path for saving the model.

Training and Testing:
python run.py -mode test -path /your/model/saved/path

-mode is used to specify whether only the training process is executed or both the training and testing processes are performed. -path is used to specify the path for saving the model.

Fine-tuning process:

This command is used to fine-tune CodeBERT on a specific dataset and then generate embeddings for subsequent nodes. Pre-trained CodeBERT weights need to be downloaded from here.

python fine-tune.py

Main Results

Here only the accuracy results are displayed; for other metrics, please refer to the paper.

Model DiverseVul VDSIC Devign ReVeal
BERT 91.99 79.41 60.58 86.88
CodeBERT 92.40 83.13 64.80 88.64
GraphCodeBERT 92.96 83.98 64.80 89.25
TextCNN 92.16 66.54 60.38 85.43
TextGCN 91.50 67.55 60.47 87.25
Devign 70.21 59.30 57.66 65.47
Our 93.06 84.38 65.70 90.80

Acknowledgement

Parts of the code for data preprocessing and graph construction using Joern are adapted from Devign. We appreciate their excellent work!