CGCL-codes / VulLLM

An implementation of the ACL 2024 Findings paper "Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning".
20 stars 1 forks source link

Question about instruction for reproducing results in the paper #2

Open CStriker opened 4 months ago

CStriker commented 4 months ago

Hello, thank you for supplying the code for the paper.

I consider this paper as currently state-of-the-art LLM-based vulnerability detection because recent ICSE papers depend on sLLMs such as codeBERT. Your paper reports that VulLLM-CL shows good results even when the test dataset is different from the train dataset, which means that your paper's method is good at generalization.

By the way, I'm trying to reproduce the results in the paper, but the current readme lacks a lot of detail. I had to fill in the missing pieces by myself but I failed to reproduce the performance in the paper.

I'm writing down the steps I made and would you please let me know what I'm missing?

git clone https://github.com/CGCL-codes/VulLLM cd VulLLM/CodeLlama

conda create -n llm python=3.8 conda activate llm conda install -y scikit-learn simplejson pip install llama-recipes[tests]

download model_checkpointing folder from https://github.com/meta-llama/llama-recipes/tree/74bde65a62667a38ee0411676cf058c53f85771c

vi configs/datasets.py

train_data_path: str = "../dataset/MixVul/multi_task/multi_train_512_augmentation.json" valid_data_path: str = "../dataset/MixVul/llm/valid_512.json"

#################################################

python finetuning.py \ --use_peft \ --model_name codellama/CodeLlama-13b-hf \ --peft_method lora \ --batch_size_training 32 \ --val_batch_size 32 \ --context_length 512 \ --quantization \ --num_epochs 3 \ --output_dir codellama-13b-multi-r16

python inference.py \ --model_type codellama \ --base_model codellama/CodeLlama-13b-hf \ --tuned_model ./codellama-13b-multi-r16/epoch-2 \ --data_file ../dataset/ReVeal/test_512.json

I tried the above steps, and the 13B model shows awful result. Surprisingly, the 7B model shows better results, but it is less than the performance in the paper.

Would you please help solve this problem?

xhdu commented 3 months ago

Thank you for your attention. Is this performance degradation observed across all datasets or only on ReVeal? As reported in my paper, on ReVeal, indeed the 7B model performs better than the 13B model. Additionally, please check if the experimental settings are consistent, which can be found in Appendix E of the paper.

CStriker commented 3 months ago

Thank you for the reply.

Performance degradations were observed when I attempted inference on ReVeal, BigVul, Devign (compared to the values in the paper). I also tested the PrimeVul test dataset (ICSE 25) on the 7B model and its F1 score is 64.62. It seems that it is generalized well but the results for other datasets are different from those in the paper. I want to make sure the baseline code is working well and would you please help me fix the current situation?

I didn't change a single thing from the command line above, and I confirmed that the default setting is the same as the experimental settings in Appendix E. If there is something wrong or something missing in the command lines above, please let me know.

xhdu commented 3 months ago

Based on the current information, I also can't identify where the error might be. Did you fine-tune the LLMs on the original dataset (without adding the vulnerability interpretation and localization tasks)? Does it align with the results of the ablation study in our paper?