arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code
8 stars 4 forks source link

(1) Extract non-crypto files from outside source #6

Closed corentinllorca closed 4 years ago

corentinllorca commented 4 years ago

See #1

Need to think about where and what to get for additional non-crypto sources.

corentinllorca commented 4 years ago

A useful list of datasets can be found here. Notably, there is a hand-labelled "code similarity" dataset, but it seems we're not going to go down that route. There is also an "identifiers" dataset that counts how many times a certain variable name appears in programming languages: for now, it isn't very useful, because we'd like to distinguish crypto code from non crypto code in one programming language. However, later on, it might serve as a base distribution for variable names in a given programming language.

The most "obvious" available database for that would be the Public Git Archive. This dataset consists of 260k+ top-bookmarked repositories from GitHub, with 136M+ files and ~28 billion lines of code. We'd have to trim it down to repositories containing C (or repos that are fully in C, or files written in C), but even then we might have to downsample given that the full dataset is about 6TB.

corentinllorca commented 4 years ago

Another useful link here: those are solutions to leetcode problems in C++. This doesn't necessary increase the diversity of our dataset as most programs we have collected so far have been algorithmic problem-solving programs, but it's a good one to keep.

redouane-dziri commented 4 years ago

For the sake of conveniency, let's start by sampling some git repos and fetching all the C files from there. Let's shoot for ~5000 (we'll see how fast the fetching goes and adapt if needed)