Closed corentinllorca closed 4 years ago
A useful list of datasets can be found here. Notably, there is a hand-labelled "code similarity" dataset, but it seems we're not going to go down that route. There is also an "identifiers" dataset that counts how many times a certain variable name appears in programming languages: for now, it isn't very useful, because we'd like to distinguish crypto code from non crypto code in one programming language. However, later on, it might serve as a base distribution for variable names in a given programming language.
The most "obvious" available database for that would be the Public Git Archive. This dataset consists of 260k+ top-bookmarked repositories from GitHub, with 136M+ files and ~28 billion lines of code. We'd have to trim it down to repositories containing C (or repos that are fully in C, or files written in C), but even then we might have to downsample given that the full dataset is about 6TB.
Another useful link here: those are solutions to leetcode problems in C++. This doesn't necessary increase the diversity of our dataset as most programs we have collected so far have been algorithmic problem-solving programs, but it's a good one to keep.
For the sake of conveniency, let's start by sampling some git repos and fetching all the C files from there. Let's shoot for ~5000 (we'll see how fast the fetching goes and adapt if needed)
See #1
Need to think about where and what to get for additional non-crypto sources.