(2) Create a couple of high-level features and explore

redouane-dziri commented 4 years ago

See #2

[x] Code up some count features (number of lines, number of function definition, number of loops, number of includes, number of bitwise operations, count of strings like 0xa753a6f5U, count of large arrays (and other structures), count of large numerical constants) using regex
[x] Explore the created features with regards to the class, type of file...
[x] Pull out a simple model (e.g. Logreg) and train it on top of this set of features
[x] Evaluate and analyze the model and draw insights on the features

redouane-dziri commented 4 years ago

Update on our progress

working on branch issue-14

We coded up some approximate count features using pattern matching:

number of lines
number of int declaration
number of long declaration
number of loops (while + for)
number of file imports
number of bit-wise operations (left/right shift, and, or, xor, complement)
number of hexadecimal numbers

To reduce noise (pattern matching...) we strip the files of comments before generating the features. (we strip comments using.. guess what... pattern matching!). Seems to work well after inspection of a few files .

We also fetched regex patterns to detect function declarations but the unrolling of the BNF grammar generated is really big and it takes forever to run. So would rather rely on (the more precise anyway) function detection enabled by syntactic trees. (Would love an update on if this is feasible @arthurherbout @arnaudstiegler )

Currently working on exploring those features. Plots we have in mind (mind you, the plots can be faceted on data_source, class, header) - boxplot and/or histograms:

average number of lines (distribution as well?)
counts of int and long declarations
counts of loops
counts of bitwise ops
counts of file includes
thinking of also fetching the top-k includes - maybe here is a strong bias in crypto files invoking the same crypto libraries/files that we should watch out for (maybe strip some includes later)

Also some exploration of the raw data:

how many files per source / class
how many header / non-header per source / class

Counts do not suffice, need to take into account the different lengths of files. Will also code up binary features and associated plots:

binary feature if there is at least one bitwise operation or not in the file (and plot number of file per class that has bitwise ops)
binary feature if there is at least one long or not in the file (and plot number of file per class that has longs)
binary feature if there is at least one hexadecimal number or not in the file (and plot number of file per class that has hex numbers)

Next steps we'll be taking in the coming days:

code up the additional binary features mentioned and generate associated plots
validate the features manually against a random sample of files to see if the counts are sensible or if the pattern matching is too noisy
train small model on those

redouane-dziri commented 4 years ago

Is this still relevant? @corentinllorca With Hadrien's experiments I feel like we don't need to train a model on our features alone anymore. If not, let's close this.

corentinllorca commented 4 years ago

I agree. Let's close it.

arthurherbout / crypto_code_detection

(2) Create a couple of high-level features and explore #14