Artifact for TOSEM paper: Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors.
For SARD dataset we have uploaded to zenodo, for Fan dataset, the related information is at MSR_20_Code_vulnerability_CSV_Dataset, the dataset csv can be downloaded from google driver. We extract func_before
and func_after
from it.
For preprocess code into graph, please refer to preprocess/ReadMe.md
Run python pretrain.py detector_name path2train_datas embedding_model_path
detector_name
: The name of detectors, choice is reveal
, devign
, ivdetect
, deepwukong
, we will soon add remaining 3 sequence-based detectors into this pipeline.
path2train_datas
: The dir which stores train_vul.json
, train_normal.json
, eval_vul.json
, eval_normal.json
, test_vul.json
, test_normal.json
, the script will read training data from train jsons.
embedding_model_path
: The path to the saved embedding model.
Run python detection.py <args>
to train detectors. <args>
includes:
--detector <detector_name>
, <detector_name>
could be one of ["deepwukong", "reveal", "ivdetect", "devign", "tokenlstm", "vuldeepecker", "sysevr"]
--w2v_model_path <model_path>
, <model_path>
could be relative or absolute path of pretrained word2vec model.
--dataset_dir <dataset_dir>
, <dataset_dir>
is path to the dir storing json datas. It should include train_vul.json
, train_normal.json
, eval_vul.json
, eval_normal.json
, test_vul.json
, test_normal.json
.
--model_dir <model_dir>
, <model_dir>
is where the model pth file placed, it's corresponding directory. The scripts will automatically load the best model in the dir.
--train
, means will train model. If there exist a model in <model_dir>
, the script will first load that model and then train.
--test
, means will test the model. There must be a model in <model_dir>
first.
Run python explain.py <args>
. <args> includes
:
--detector <detector_name>
, <detector_name>
could be one of ["deepwukong", "reveal", "ivdetect", "devign", "tokenlstm", "vuldeepecker", "sysevr"]
--w2v_model_path <model_path>
, <model_path>
could be relative or absolute path of pretrained word2vec model.
--dataset_dir <dataset_dir>
, <dataset_dir>
is path to the dir storing json datas. It should include test_vul.json
.
--model_dir <model_dir>
, <model_dir>
is where the model pth file placed, it's corresponding directory. The scripts will automatically load the best model in the dir.
--explainer <explainer_name>
, <explainer_name>
could be one of ["gnnexplainer", "pgexplainer", "gnnlrp", "gradcam", "deeplift"]
for now. We are organizing the code in sequence-based explainers into this pipeline.
@misc{cheng2024fidelity,
title={Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors},
author={Baijun Cheng and Shengming Zhao and Kailong Wang and Meizhen Wang and Guangdong Bai and Ruitao Feng and Yao Guo and Lei Ma and Haoyu Wang},
year={2024},
eprint={2401.02686},
archivePrefix={arXiv},
primaryClass={cs.CR}
}