Improve quality of data

corentinllorca commented 5 years ago

We are having doubts about our dataset either not covering enough code file types, or being "too separable" between cryptographic and non cryptographic code. This might be because cryptographic code is inherently easy to separate from other types of code, or because our dataset hasn't been curated well enough. In order to test that, we will try to add specific categories of non-cryptographic code to our dataset, from diverse types of projects. In the end, the aim would maybe even be to completely replace the randomly generated "other" dataset.

We need to make a selection of types of code so that: they encompass enough types of code for our model to be generalizable, and at the same time they are "close" to crypto code, in order for our algorithm to make a finer separation.

This raises the question of what the limit of cryptography is. Ex: is networked communication (IPv4/ IPv6) considered cryptography?

corentinllorca commented 5 years ago

Found: GitHub Topics page, which groups repositories into "topics". This is very useful as it effectively allows us to look at each topic and hand-pick the repos that we want to integrate to our dataset. This would even allow us to include more crypto code, because there is a cryptoraphy topic (and also encryption, decryption), with more than 3000 repos.

corentinllorca commented 5 years ago

Some nice topics that use bit-level operations but that don't seem to be cryptography:

microcontrollers, IoT and all kinds of micro-OS
(to be continued)

arnaudstiegler commented 5 years ago

I think it is a very good idea to expand the dataset to make the problem more challenging. And based on what I've seen, I think we should write down some guidelines to manually label files so that we don't face the same issues we have with v1

A few random thoughts on that based on what I've seen from the model results:

there were some crypto files incorrectly labelled in the non-crypto set: should we solve that or think of it as regularization?
header files are deceptive: around 50% of false predictions come from header files, and this is usually due to the fact that they define crypto functions but the only hint of it being crypto is just the function name (which is discarded at least for my model).

redouane-dziri commented 5 years ago

Header files are problematic indeed, unsure if we would all agree on what h files should be considered crypto, good point @arnaudstiegler

corentinllorca commented 5 years ago

We have decided to keep all .c and .h files for certain repos. We need the .h files, otherwise the parsers to build ASTs won't function. This means that we'll have to be extra careful when labeling those files.

corentinllorca commented 5 years ago

Change of approach: since hand-labeling the files seems like a very painful thing to do (but we might still do it later), we will hand-label the repos instead. There might be a high number of false positives here, but we'll look at that later on. The final JSON will look like this:

[
       {
           file_name: ...,
           is_header: ...,
           source_username: ...,
           source_repo: ...,
           file_path: ..., 
           label: ...,
           content: ...
       },
       {
           file_name: ...,
           is_header: ...,
           source_username: ...,
           source_repo: ...,
           file_path: ..., 
           label: ...,
           content: ...
       },
       ....
 ]

Source_repo is the url of the repo the file originally came from, source_username the username of that repo's author, and file_path the path of the file within the repo. We keep all of this data to uniquely identify files (some files might have the same name even within the same repo, and some repos by different authors might have the same name too).

corentinllorca commented 5 years ago

Pushed the extract_data script on new_dataset branch, and modified the train_test script so that it stratifies by label instead of data source. Please note that repos have one unique label. This will obviously lead to false positives, but we'll address that issue later. Things we now have to do and think about:

Populate our dataset by listing the repos we want in the repos_labels.txt, with their label
Think about train/test split: if we want to do ASTs, we will need to have the complete repos. Maybe only split repos and not individual files in the train/test split
Change of scope: maybe recognize crypto repos instead of crypto files?
If we still want to recognize at file level (or even function level), we will need to label this data and modify the data extraction script to take that into account.

arnaudstiegler commented 5 years ago

@corentinllorca, I think the change of scope would be a problem for many reasons:

how would you label a repo? Go over each file, see whether they are crypto or not and output a label for the entire repo? But then, what's the difference with doing individual files prediction if in the end, you end up predicting on a file level?
based on my observations from crypto libraries, they contain huge chunks of code that are non-crypto. If you were to assign the label crypto to all the files in there, the labels would be extremely noisy and it is unlikely that we get good score from such noisy labels
I think it is more interesting to know where the crypto is in the repo than just to know if there is crypto in the repo

corentinllorca commented 5 years ago

What to do next:

Non-crypto code:

List of topics (in progress):

[x] OS code (BE CAREFUL: most OS repos actually contain crypto! So maybe download those as part of the crypto repos, not the non-crypto ones)
[x] ML code
[x] math code
[ ] Web services systems code
[x] Databases for C
[ ] Streaming
[x] Anything that uses hash functions
[x] digital signal processing
[x] Networks
[x] Bitwise operations
[ ] (list can be continued)

How to add it : For each topic, select a few repos and add them to the noncrypto_repos.txt file (branch: new_dataset). Make sure to download repos that don't implement cryptography at all. To make sure: use the "find file" function in github.com and search for "crypto" or similar keywords: generally, it will give you crypto implementations in they are present in the repo. That way, we won't have to hand-label those.

Crypto code:

Where to find the crypto code:

In the topics that are listed above. If you find a repo that contains crypto in one of those topics, don't throw it away, put it in here instead.
In topics like "cryptography", "cryptocurrency", "encryption" and so on.

How to add it: For each repo:

Download the repo manually.
Go to "data/untreated_crypto" folder. Inside the "all_files" folder, make a folder whose name is the username of the author of the repo (if that folder doesn't already exist). Inside that folder, copy the root folder of the repo. Then, do the same thing inside the "no_crypto_files" folder. Example: if you're downloading the "libsodium" repo by "jedisct1", make a "jedisct1" folder inside the all_files folder, then copy the root folder of the repo inside "jedisct1" (the root folder should also be named "libsodium"). Then repeat the operation inside the "no_crypto_files" folder. You should now have two copies of the repo, at "data/untreated_crypto/all_files/jedisct1/libsodium" and at "data/untreated_crypto/no_crypto_files/jedisct1/libsodium".

Next up is the actual hand-labeling part. For a given repository, it would represent too much work to check every single file in the repo, so it is possible for us to simply ignore some files and leave them as "undefined", if we want. Those files won't appear at all in the data. As for the other files we check, they will get the appropriate label (0 or 1). Here is how to proceed (the following two steps are interchangeable and can be executed iteratively):

In the copy of the repo you've placed in the "no_crypto_files" folder: look at files in the repo, and delete the file if it contains crypto. You can ignore the non-C files.
In the copy of the repo you've placed in the "all_files" folder: delete the files/folders that you're not going to check (the "undefined" files). This is important: if there is a file or a folder that you haven't checked and you don't delete it, then the script will assume it's non-crypto. If you're comfortable with assuming that a file or a whole folder doesn't contain any crypto without checking it, then you can leave the file/folder here.

At the end of those steps, you should have the following:

An "all_files" folder with only the files that you've checked. Or, more generally, only files for which you are confident of their label.
A "no_crypto_files" that will contain "undefined" files (the script will ignore them) and non_crypto files. Be careful with this: if you flag a crypto file, it should not be in this folder. If you flag a file as non_crypto, it should be in this folder. If there is a file you haven't checked and you leave as undefined, you should delete it from "all_files", but you can leave it in "no_crypto_files" for convenience (it'll be ignored anyway).

The script will take things from there and will build the data by itself.

After all this, one run of the build_dataset script will be enough to build the jsons and have our clean dataset.

arnaudstiegler commented 5 years ago

I would just advise avoiding downloading libsodium since it's already in there 😛 For the crypto libraries, check what's already in the repo before adding something

arnaudstiegler commented 5 years ago

I'll do some non-crypto datasets, since I did crypto for the first batch of data

arthurherbout commented 5 years ago

I will continue on the non-crypto stuff

arnaudstiegler commented 5 years ago

The crypto datasets we have already extracted are:

wolfssl
openssl
libsodium
NaCL
ARMmbed
Nettle
libgcrypt

From those repos, I extracted the ciphers mostly, usually stored in a crypto folder. If you wanna go over one more time, be my guest, but it would be more interesting to get different repos

arthurherbout commented 5 years ago

I am currently dong the ML topic. Any idea on the number of repos by topic?

arnaudstiegler commented 5 years ago

I'll look into databases and some hashing things. I've extracted:

Note that:

some database repos contain hashing (but no crypto)
all hashing files are very small repos only implementing one or a few hashing functions (to make sure it is not being used for crypto purposes)

redouane-dziri commented 5 years ago

Doing non-crypto:

arthurherbout commented 5 years ago

machine learning: https://github.com/dmlc/xgboost https://github.com/microsoft/CNTK https://github.com/mozilla/DeepSpeech https://github.com/microsoft/LightGBM https://github.com/apple/turicreate https://github.com/horovod/horovod https://github.com/catboost/catboost https://github.com/google/mediapipe https://github.com/mlpack/mlpack https://github.com/aksnzhy/xlearn https://github.com/shogun-toolbox/shogun https://github.com/interpretml/interpret
math: https://github.com/recp/cglm https://github.com/felselva/mathc https://github.com/libtom/libtommath https://github.com/shibatch/sleef https://github.com/mrDIMAS/DmitrysEngine
OS: for those they usually have a crypto folder. We can assume all crypto is there. https://github.com/reactos/reactos https://github.com/RIOT-OS/RIOT https://github.com/klange/toaruos (no such folder) https://github.com/yodaos-project/yodaos https://github.com/illumos/illumos-gate https://github.com/SilverRainZ/OS67 (no such folder) https://github.com/mohamed-anwar/Aquila (no such folder)

arthurherbout commented 4 years ago

I have labeled wolfssl, opensll and the OS repos given the much more precise definition of crypto-code. I still have to do the other crypto libraries

arthurherbout / crypto_code_detection

Improve quality of data #35

Non-crypto code:

Crypto code: