There are some errors in the data set.

dessertlab / EVIL

EVIL (Exploiting software VIa natural Language) is an approach to automatically generate software exploits in assembly/Python language from descriptions in natural language. The approach leverages Neural Machine Translation (NMT) techniques and a dataset that we developed for this work.

GNU General Public License v3.0

26 stars 3 forks source link

There are some errors in the data set. #1

Closed NTDXYG closed 2 years ago

NTDXYG commented 2 years ago

For example, in python dataset, in the file "encoder-train.in", line 3230 is "define the method serialize_headers with an argument self." in the file "encoder-train.out", line 3230 is "def streaming_content ( self ) :" This is with an obvious method name error. Errors such as this exist in large numbers in the dataset, resulting in one input and multiple outputs after IP resolution. For example, above data after IP, is: "define the method var0 with an argument self." and "def streaming_content ( self ) :" The placeholder var0 does not represent the code correctly.

NTDXYG commented 2 years ago

I recommend that you are able to pre-process the data set thoroughly to improve the quality of the data.

piliguori commented 2 years ago

The encoder dataset includes both the original, exploit-oriented snippets and snippets from a previous (not created by our team) general-purpose Python dataset (the Django dataset, https://github.com/odashi/ase15-django-dataset) to enable the NMT model to generate code that can mix general-purpose and exploit-oriented instructions. While the exploit-oriented snippets are high-quality examples (multiple authors collected and described the code), we acknowledge that the examples in the Django dataset, instead, are potentially noisy (as matter of fact, the examples you highlighted are from the Django dataset). Nevertheless, we choose the Django dataset due to its large size to help the training phase of the NMT models. To properly show the feasibility of the approach in generating software exploits, we kept the original and noisy Django dataset, as it is widely used in different code generation tasks.

NTDXYG commented 2 years ago

Thanks for your reply, but I did download the Django dataset and double checked it. The examples I mentioned above are in "all.anno" and "all.code" at line 10320. Their input and output are "define the method serialize_headers with an argument self." and "def serialize_headers ( self ) :" respectively.

You can see that the original Django dataset is correct, so I would still recommend that you can double-check the data processing part of the encoder.

Thanks again for your reply.

piliguori commented 2 years ago

The example you highlighted refers to line 10465 of the original Django dataset, where the intent is "define the method serialize_headers with an argument self", while the related (wrong) snippet is "def streaming_content ( self ) :" I remark that we did not add any change to the original Django dataset.

NTDXYG commented 2 years ago

You're right, I admit that there are real problems with the Django dataset. It's a real headache though.

Thanks again for your reply and I will close this issuse.

piliguori commented 2 years ago

You are welcome! Thanks for raising the issue, and for your interest in our dataset.