This is the replication package of paper "Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks".
Our datasets and online appendix can be found here.
common
: Common modules for both pre-training and downstream tasks.data
: Datasets and saved models.
datasets
: Datasets for experiments.models
: Saved models for intrinsic evaluation and extrinsic evaluation (fine-tuned).dist_importing
: For importing modules when dist_training is enabled for allennlp.downstreams
: Modules, scripts and configs for downstreams tasks (extrinsic evaluation).pretrain
: Modules, scripts and configs for pre-training task (intrinsic evaluation).utils
: Utility functions.To obtain the results of Table 1 & Table 2 in our paper.
pretrain
folder (This is important for relative path retrieving).python eval_partial_func_pdg.py
python eval_full_func_pdg.py
packed_hybrid_vol_221228.pkl
, and the ground truth of control dependency prediction (CDG) and data dependency prediction (DDG) has been constructed based on the outputs of Joern and provided in this file.models/intrinsic
, but this is only for intrinsic evaluation.We use three vulnerability analysis tasks for extrinsic evaluation: vulnerability detection, vulnerability classification and vulnerability assessment.
To make training and testing as a unified pipeline, you should open downstream/global_vars.json
to make some configurations.
In detail, the key of the object in downstream/global_vars.json
should be the name of your machine (run Python command import platform; print(platform.node())
to check), and the python_bin
should be the path your Python binary located.
downstream
folder (This is important for relative path retrieving).python train_eval_from_config.py -config configs/vul_detect/pdbert_reveal.jsonnet -task_name vul_detect/reveal -average binary
python train_eval_from_config.py -config configs/vul_detect/pdbert_devign.jsonnet -task_name vul_detect/devign -average binary
python train_eval_from_config.py -config configs/vul_detect/pdbert_bigvul.jsonnet -task_name vul_detect/bigvul -average binary
downstream
folder (This is important for relative path retrieving).python train_eval_from_config.py -config configs/cwe_class/pdbert.jsonnet -task_name cwe_class -average macro -extra_averages weighted
downstream
folder (This is important for relative path retrieving).python train_eval_multi_task_from_config.py -config configs/vul_assess/pdbert.jsonnet -task_name vul_assess -extra_eval_configs "{\"task_names\":\"CPL,AVL,CFD,ITG\"}" -eval_script eval_multi_task_classification -average macro -extra_averages weighted
downstream/configs
accordingly.data_loader/batch_size
in the config. But to keep consistent with our configuration, you should correspondingly increase the trainer/num_gradient_accumulation_steps
, since the real batch size is batch_size * num_gradient_accumulation_steps
.git lfs install
git clone https://huggingface.co/microsoft/codebert-base