arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code
8 stars 4 forks source link

Improve quality of data #35

Open corentinllorca opened 4 years ago

corentinllorca commented 4 years ago

We are having doubts about our dataset either not covering enough code file types, or being "too separable" between cryptographic and non cryptographic code. This might be because cryptographic code is inherently easy to separate from other types of code, or because our dataset hasn't been curated well enough. In order to test that, we will try to add specific categories of non-cryptographic code to our dataset, from diverse types of projects. In the end, the aim would maybe even be to completely replace the randomly generated "other" dataset.

We need to make a selection of types of code so that: they encompass enough types of code for our model to be generalizable, and at the same time they are "close" to crypto code, in order for our algorithm to make a finer separation.

This raises the question of what the limit of cryptography is. Ex: is networked communication (IPv4/ IPv6) considered cryptography?

corentinllorca commented 4 years ago

Found: GitHub Topics page, which groups repositories into "topics". This is very useful as it effectively allows us to look at each topic and hand-pick the repos that we want to integrate to our dataset. This would even allow us to include more crypto code, because there is a cryptoraphy topic (and also encryption, decryption), with more than 3000 repos.

corentinllorca commented 4 years ago

Some nice topics that use bit-level operations but that don't seem to be cryptography:

arnaudstiegler commented 4 years ago

I think it is a very good idea to expand the dataset to make the problem more challenging. And based on what I've seen, I think we should write down some guidelines to manually label files so that we don't face the same issues we have with v1

A few random thoughts on that based on what I've seen from the model results:

redouane-dziri commented 4 years ago

Header files are problematic indeed, unsure if we would all agree on what h files should be considered crypto, good point @arnaudstiegler

corentinllorca commented 4 years ago

We have decided to keep all .c and .h files for certain repos. We need the .h files, otherwise the parsers to build ASTs won't function. This means that we'll have to be extra careful when labeling those files.

corentinllorca commented 4 years ago

Change of approach: since hand-labeling the files seems like a very painful thing to do (but we might still do it later), we will hand-label the repos instead. There might be a high number of false positives here, but we'll look at that later on. The final JSON will look like this:

[
       {
           file_name: ...,
           is_header: ...,
           source_username: ...,
           source_repo: ...,
           file_path: ..., 
           label: ...,
           content: ...
       },
       {
           file_name: ...,
           is_header: ...,
           source_username: ...,
           source_repo: ...,
           file_path: ..., 
           label: ...,
           content: ...
       },
       ....
 ]

Source_repo is the url of the repo the file originally came from, source_username the username of that repo's author, and file_path the path of the file within the repo. We keep all of this data to uniquely identify files (some files might have the same name even within the same repo, and some repos by different authors might have the same name too).

corentinllorca commented 4 years ago

Pushed the extract_data script on new_dataset branch, and modified the train_test script so that it stratifies by label instead of data source. Please note that repos have one unique label. This will obviously lead to false positives, but we'll address that issue later. Things we now have to do and think about:

arnaudstiegler commented 4 years ago

@corentinllorca, I think the change of scope would be a problem for many reasons:

corentinllorca commented 4 years ago

What to do next:

Non-crypto code:

List of topics (in progress):

How to add it : For each topic, select a few repos and add them to the noncrypto_repos.txt file (branch: new_dataset). Make sure to download repos that don't implement cryptography at all. To make sure: use the "find file" function in github.com and search for "crypto" or similar keywords: generally, it will give you crypto implementations in they are present in the repo. That way, we won't have to hand-label those.

Crypto code:

Where to find the crypto code:

How to add it: For each repo:

Next up is the actual hand-labeling part. For a given repository, it would represent too much work to check every single file in the repo, so it is possible for us to simply ignore some files and leave them as "undefined", if we want. Those files won't appear at all in the data. As for the other files we check, they will get the appropriate label (0 or 1). Here is how to proceed (the following two steps are interchangeable and can be executed iteratively):

At the end of those steps, you should have the following:

The script will take things from there and will build the data by itself.

After all this, one run of the build_dataset script will be enough to build the jsons and have our clean dataset.

arnaudstiegler commented 4 years ago

I would just advise avoiding downloading libsodium since it's already in there 😛 For the crypto libraries, check what's already in the repo before adding something

arnaudstiegler commented 4 years ago

I'll do some non-crypto datasets, since I did crypto for the first batch of data

arthurherbout commented 4 years ago

I will continue on the non-crypto stuff

arnaudstiegler commented 4 years ago

The crypto datasets we have already extracted are:

From those repos, I extracted the ciphers mostly, usually stored in a crypto folder. If you wanna go over one more time, be my guest, but it would be more interesting to get different repos

arthurherbout commented 4 years ago

I am currently dong the ML topic. Any idea on the number of repos by topic?

arnaudstiegler commented 4 years ago

I'll look into databases and some hashing things. I've extracted:

Note that:

redouane-dziri commented 4 years ago

Doing non-crypto:

arthurherbout commented 4 years ago
arthurherbout commented 4 years ago

I have labeled wolfssl, opensll and the OS repos given the much more precise definition of crypto-code. I still have to do the other crypto libraries