VulDetProject / ReVeal

MIT License
180 stars 61 forks source link

Data Reproduction issues #21

Open hrshy0629 opened 1 year ago

hrshy0629 commented 1 year ago

First of all, all the links in bash seem to be broken. I'm not sure if it's my device? Second, I understand that the order of execution of the code is: 1) get_raw_data.sh in data_collection; 2) process_data****.ipynb; 3) Find a way to execute code-slicer, but I'm not ready for it due to lack of information (?) There may be some related data generated, but the unexpected one is devign_cfg_full_text_files.json. I don't know how to generate this file?? 4) extract_slices.ipynb will then slice and generate devign_full_data_with_slices.json, which can be parsed by create_ggnn_data.py to generate data of various structures. Or it can be processed by full_data_prep_script.ipynb to generate the relevant data!

Did anyone go through the whole process?

Royz2123 commented 1 year ago

Having similar issues as well - it seems that both the data and the model's Google Drive links are broken (returning a 404 status code).

Bad Models Link: https://drive.google.com/file/d/1gTgpgXGzSBlixNcUS-OaoXe8HxQXzaf0 Bad Data Link: https://drive.google.com/file/d/1Mn0jLaZWiPFQ8ejzlz_zXnx_TcSzbwu1

jpuci commented 1 year ago

I have reached out to the owner of this repository and unfortunately it looks like they don't have any backup for either data or models

hrshy0629 commented 1 year ago

Oh my god, doesn't that mean the paper can't be reproduced??

------------------ 原始邮件 ------------------ 发件人: "VulDetProject/ReVeal" @.>; 发送时间: 2023年3月14日(星期二) 上午8:10 @.>; @.**@.>; 主题: Re: [VulDetProject/ReVeal] Data Reproduction issues (Issue #21)

I have reached out to @saikat107 and unfortunately it looks like they don't have any backup for either data or models

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Royz2123 commented 1 year ago

Indeed ...

jpuci commented 1 year ago

Not necessearly or in full. I haven't figured it out completly yet, but check out this paper: https://arxiv.org/abs/2212.08109. They have reproduced several papers, this one included. The data is different I belive, but it's still real-life data. I have not seen a single instruction on how to reproduce their results, but it looks promising.

hrshy0629 commented 1 year ago

In addition to the devign data, other studies have used data enhancement methods, adding SARD data or sampling balance methods. I don't have sufficient confidence in these models. You can see how much F1 drops from the paper <SEVulDet A Semantics-Enhanced Learnable Vulnerability Detector> when their model is applied to full real data (although still an unbalanced data (6% vul data)).

------------------ 原始邮件 ------------------ 发件人: "VulDetProject/ReVeal" @.>; 发送时间: 2023年3月18日(星期六) 晚上6:01 @.>; @.**@.>; 主题: Re: [VulDetProject/ReVeal] Data Reproduction issues (Issue #21)

Not necessearly or in full. I haven't figured it out completly yet, but check out this paper: https://arxiv.org/abs/2212.08109. They have reproduced several papers, this one included. The data is different I belive, but it's still real-life data. I have not seen a single instruction on how to reproduce their results, but it looks promising.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Powerdiao commented 1 year ago

Not necessearly or in full. I haven't figured it out completly yet, but check out this paper: https://arxiv.org/abs/2212.08109. They have reproduced several papers, this one included. The data is different I belive, but it's still real-life data. I have not seen a single instruction on how to reproduce their results, but it looks promising.

I have read this article and they have reproduced several models, but the relevant code has not yet been released

1code12 commented 10 months ago

First of all, all the links in bash seem to be broken. I'm not sure if it's my device?首先,bash中的所有链接似乎都被破坏了。我不确定这是否是我的设备? Second, I understand that the order of execution of the code is: 1) get_raw_data.sh in data_collection; 2) process_data.ipynb; 3) Find a way to execute code-slicer, but I'm not ready for it due to lack of information (?) There may be some related data generated, but the unexpected one is devign_cfg_full_text_files.json. I don't know how to generate this file?? 4) extract_slices.ipynb will then slice and generate devign_full_data_with_slices.json, which can be parsed by create_ggnn_data.py to generate data of various structures. Or it can be processed by full_data_prep_script.ipynb to generate the relevant data!其次,我理解代码的执行顺序是:1)get_raw_data.sh data_collection;2) process_data.ipynb;3)找到一种执行代码切片器的方法,但由于缺乏信息,我还没有准备好(?可能会生成一些相关数据,但意外的是 devign_cfg_full_text_files.json。我不知道如何生成这个文件??4)extract_slices.ipynb随后将切片并生成devign_full_data_with_slices.json,可以通过create_ggnn_data.py解析以生成各种结构的数据。或者可以通过full_data_prep_script.ipynb进行处理以生成相关数据!

Did anyone go through the whole process?有没有人经历过整个过程?

@hrshy0629 Hello. Now, have you managed the whole process? Could you tell me about it? Thank you

biringaChi commented 9 months ago

Hi guys, I finally retrieved the missing dataset. Visit (https://github.com/biringaChi/vRAG) and follow the README instructions to download. Cheers.

pogamar commented 8 months ago

Can anyone explain the full replication process? thank you

zzhisthebest commented 3 months ago

@hrshy0629 hi,have you find the method to generate devign_cfg_full_text_files.json?