CGCL-codes / VulLLM

An implementation of the ACL 2024 Findings paper "Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning".
19 stars 1 forks source link

Could you please release your code for GPT-4? #1

Closed liususan091219 closed 4 months ago

liususan091219 commented 5 months ago

Dear authors, could you please release the code for prompting GPT-4? If I'm not wrong, it looks like you have only released the code for fine tuning. Thanks in advance.

xhdu commented 4 months ago

Dear authors, could you please release the code for prompting GPT-4? If I'm not wrong, it looks like you have only released the code for fine tuning. Thanks in advance.

Thank you for your attention. I have added the code for prompting GPT-4 and also included a CSV file containing the vulnerable lines and dependency lines. For the results from GPT-4, please refer to the previously released filtered multi-task dataset, which is located at the path dataset/MixVul/multi_task/.

xhdu commented 4 months ago

Note: Due to the randomness introduced by GPT-4’s sampling techniques, its interpretations of vulnerabilities may not be exactly the same each time. Therefore, the results in our multi-task dataset may be difficult to fully reproduce.

liususan091219 commented 4 months ago

HI Xiaohu, thanks a lot for your quick response and for sharing the code!

I just hope to ask a clarification question on the "Vulnerability Lines" for GPT-4. In the paper:

We extract vulnerability lines from the patches of the vulnerable code. Patches are generated to fix existing vulnerabilities via adding or deleting certain code elements

The input of VulLLM is a code (not a patch), right? Since some of the datasets in Table 1 only contains the code not the patch. How is the patch generated here?

Thanks in advance for your help!

xhdu commented 4 months ago

HI Xiaohu, thanks a lot for your quick response and for sharing the code!

I just hope to ask a clarification question on the "Vulnerability Lines" for GPT-4. In the paper:

We extract vulnerability lines from the patches of the vulnerable code. Patches are generated to fix existing vulnerabilities via adding or deleting certain code elements

The input of VulLLM is a code (not a patch), right? Since some of the datasets in Table 1 only contains the code not the patch. How is the patch generated here?

Thanks in advance for your help!

Yes, the input of VulLLM does not include the patch. The patch is used for generating data for auxiliary tasks. The patch is a field included in the PatchDB dataset, which you can view at this link: https://huggingface.co/datasets/sunlab/patch_db. Please note that we only use PatchDB to generate data for auxiliary tasks. Since the pre-fix code does not include the added lines, we only consider the deleted lines in the patch as the vulnerable lines. Based on these vulnerable lines, we use JOERN to extract dependency lines and then prompt GPT-4.

liususan091219 commented 4 months ago

HI Xiaohu, thanks a lot for your quick response and for sharing the code! I just hope to ask a clarification question on the "Vulnerability Lines" for GPT-4. In the paper: We extract vulnerability lines from the patches of the vulnerable code. Patches are generated to fix existing vulnerabilities via adding or deleting certain code elements The input of VulLLM is a code (not a patch), right? Since some of the datasets in Table 1 only contains the code not the patch. How is the patch generated here? Thanks in advance for your help!

Yes, the input of VulLLM does not include the patch. The patch is used for generating data for auxiliary tasks. The patch is a field included in the PatchDB dataset, which you can view at this link: https://huggingface.co/datasets/sunlab/patch_db. Please note that we only use PatchDB to generate data for auxiliary tasks. Since the pre-fix code does not include the added lines, we only consider the deleted lines in the patch as the vulnerable lines. Based on these vulnerable lines, we use JOERN to extract dependency lines and then prompt GPT-4.

Thanks for your response. So you used PatchDB data to instruction fine tune an LLM, and what is the instruction fine tuned LLM used for? Since the input code doesn't contain the +- lines, I guess is the instruction fine-tuned LLM used for generating the +- lines, so it becomes a patch so you can start using joern?

I think if the patchDB instruction fine tuning code can be uploaded, it'll help explain more clearly. Thanks!

xhdu commented 4 months ago

HI Xiaohu, thanks a lot for your quick response and for sharing the code! I just hope to ask a clarification question on the "Vulnerability Lines" for GPT-4. In the paper: We extract vulnerability lines from the patches of the vulnerable code. Patches are generated to fix existing vulnerabilities via adding or deleting certain code elements The input of VulLLM is a code (not a patch), right? Since some of the datasets in Table 1 only contains the code not the patch. How is the patch generated here? Thanks in advance for your help!

Yes, the input of VulLLM does not include the patch. The patch is used for generating data for auxiliary tasks. The patch is a field included in the PatchDB dataset, which you can view at this link: https://huggingface.co/datasets/sunlab/patch_db. Please note that we only use PatchDB to generate data for auxiliary tasks. Since the pre-fix code does not include the added lines, we only consider the deleted lines in the patch as the vulnerable lines. Based on these vulnerable lines, we use JOERN to extract dependency lines and then prompt GPT-4.

Thanks for your response. So you used PatchDB data to instruction fine tune an LLM, and what is the instruction fine tuned LLM used for? Since the input code doesn't contain the +- lines, I guess is the instruction fine-tuned LLM used for generating the +- lines, so it becomes a patch so you can start using joern?

I think if the patchDB instruction fine tuning code can be uploaded, it'll help explain more clearly. Thanks!

I don't quite understand what you mean. The instruction fine tuned LLM is VulLLM, which is used to detect vulnerabilities.

Let me outline the entire process again. First, we use the PatchDB dataset to obtain vulnerable lines (based on patches) and dependency lines (based on source code, vulnerable lines, and JOERN). Then, we use GPT-4 to obtain vulnerability interpretations. Finally, we use three tasks to fine-tune open-source CodeLLMs. The format of the instruction fine-tuning dataset can refer to the released datasets or Appendix A.

liususan091219 commented 4 months ago

Let's say it's a function on GitHub that does not have any patch yet (e.g., an example in the Devign dataset). How do you use the PatchDB dataset to obtain the patch? I know PatchDB has the patch. But in Devign, the patch has not happened yet since it's 0 day, right?

xhdu commented 4 months ago

Let's say it's a function on GitHub that does not have any patch yet (e.g., an example in the Devign dataset). How do you use the PatchDB dataset to obtain the patch? I know PatchDB has the patch. But in Devign, the patch has not happened yet since it's 0 day, right?

Yes, other datasets do not contain ready-made patch information, and obtaining patches corresponding to vulnerabilities is also challenging. Therefore, we only perform vulnerability interpretation on PatchDB.

liususan091219 commented 4 months ago

Thanks for your quick response.

For the Devign result in Table 1, do you use the framework in Figure 1&2 or not? It seems your framework relies on the patch, but if we already know the patch, it's not 0 day detection anymore. Therefore, we can't apply Figure 1&2 to Devign.

Could you help outline the process for Devign?

xhdu commented 4 months ago

Thanks for your quick response.

For the Devign result in Table 1, do you use the framework in Figure 1&2 or not? It seems your framework relies on the patch, but if we already know the patch, it's not 0 day detection anymore. Therefore, we can't apply Figure 1&2 to Devign.

Could you help outline the process for Devign?

In our experiment, Devign is only used for the vulnerability detection task shown in Figure 1 (as described in Section 3.4). We apologize for not extracting auxiliary task data from Devign, so there are no related process.

liususan091219 commented 4 months ago

I think I got it now: your paper is about collecting the training data to instruction fine tune the VulLLM. Then you use the VulLLM to do a second fine tune on the Devign data. It doesn't do the interpretation at inference time.

Thank you so much for your patience and explanations!