Data cleaning + Benchmark

OK so this is a pretty big PR. As I was trying to benchmark I ran into several issues with the data collected and dealt with them here.

Data-related tasks done here:

there were a great number of .S files in crypto-library/files/* (as well as a couple .gitignore, .gitlab.., .s and .yml) which I had to find and delete manually because Wind River was fetching them and matching on them, polluting its output
I modifed the code-jam data to remove the code-jam_ prefix in the file names to align with the other data sources. Always, when querying the data, the unique identifier across all sources is (data_source, file_name)
I dealt with path issues in code-library to save the file paths appropriately so that we can always map where they are from the jsons and dataframes we build. Whereas before, if you had the file in openssl/a/b/file.c we saved it as openssl/file.c, now care has been taken to also keep all intermediary subfolders to re-trace it.
Re-generated full-data.json with those changes as well as train.json and test.json, they have never been cleaner ;)

Regarding the benchmark:

added the models/benchmark folder, with the files output by running Wind-River on our files directories (without tweaking any setting or anything) as well as code to extract information from these files (primarily the predictions, and some context for the predictions to analyze what kind of pattern drew the attention of the Wind-River crypto-detector.
computed scoring of this baseline which gives - on the current data: an F2 score of 0.857, a precision of 0.793 and a recall of 0.874.

These values are completely reproducible and everyone should be able to re-obtain them with a click of a button :))

arthurherbout / crypto_code_detection