DanielLin1986 / RepresentationsLearningFromMulti_domain

To learn function representations from multi-domain knowledge bases for software vulnerability detection.
13 stars 7 forks source link

There are some errors in the dataset #1

Open quyutest opened 3 years ago

quyutest commented 3 years ago

I observed several errors in the dataset, which make the dataset (the Data directory) unusable.

For instance, Six_projects\LibPNG\Non_vulnerable_functions\0014_fakepng.c_main.c This file's source code is:

int main(void) { fwrite(signature, sizeof signature, 1, stdout); put_chunk(IHDR, sizeof IHDR); for (;;) put_chunk(unknown, sizeof unknown); }

However, the "int" and "main(void)" should be in the same line.

Similarly, Six_projects\Asterisk\Non_vulnerable_functions\abstract_jb.c_ast_jb_enable_for_channel.c This file's source code is:

} void ast_jb_enable_for_channel(struct ast_channel *chan) { struct ast_jb_conf conf = ast_channel_jb(chan)->conf; if (ast_test_flag(&conf, AST_JB_ENABLED)) { ast_jb_create_framehook(chan, &conf, 1); } }

There is a redundant "}" in line 1. Please fix these errors.

DanielLin1986 commented 3 years ago

Hi there! Thank you for your comments.

For the first issue that the "int" and "main(void)" should be in the same line, it is okay because the source code will eventually be converted to a sequence, so you will get:

"int main(void) { fwrite(signature, sizeof .....}"

That is, everything will be in the same line. And the sequence will be treated as a "sentence", inputted to the neural network.

For the second issue, it is an issue caused by the buggy file which is from the following link:

https://github.com/DanielLin1986/function_representation_learning/blob/master/Code/ExtractCFunctionByName_v2.py.

What we do is to further process the source code files by removing the redundant "}" in line 1. So, the extracted C function source files need to be further processed.