(1) Extract non-cryptography files from CodeJam

arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code

8 stars 4 forks source link

(1) Extract non-cryptography files from CodeJam #5

Closed corentinllorca closed 4 years ago

corentinllorca commented 4 years ago

See #1

arthurherbout commented 4 years ago

I started this task. The dataset I got is quite big (16G) so we need to find a way of taking a subsample.

Issues with the dataset:

It is divided into problems so there will be a lot of duplicates.
all languages. We need to decide on which languages to keep. Only c, c++? What about .h files?

I am currently exploring the dataset to get a good feeling about it.

redouane-dziri commented 4 years ago

Let's get .c, .cpp, .h and .hpp if there are some! We decided on fetching 20 samples from each problems to get started. Hopefully there won't be too many duplicates and redundancies

redouane-dziri commented 4 years ago

Merged and done