arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code
8 stars 4 forks source link

(2) Create a couple of high-level features and explore #14

Closed redouane-dziri closed 4 years ago

redouane-dziri commented 4 years ago

See #2

redouane-dziri commented 4 years ago

Update on our progress

working on branch issue-14

We coded up some approximate count features using pattern matching:

To reduce noise (pattern matching...) we strip the files of comments before generating the features. (we strip comments using.. guess what... pattern matching!). Seems to work well after inspection of a few files .

We also fetched regex patterns to detect function declarations but the unrolling of the BNF grammar generated is really big and it takes forever to run. So would rather rely on (the more precise anyway) function detection enabled by syntactic trees. (Would love an update on if this is feasible @arthurherbout @arnaudstiegler )

Currently working on exploring those features. Plots we have in mind (mind you, the plots can be faceted on data_source, class, header) - boxplot and/or histograms:

Also some exploration of the raw data:

Counts do not suffice, need to take into account the different lengths of files. Will also code up binary features and associated plots:

Next steps we'll be taking in the coming days:

redouane-dziri commented 4 years ago

Is this still relevant? @corentinllorca With Hadrien's experiments I feel like we don't need to train a model on our features alone anymore. If not, let's close this.

corentinllorca commented 4 years ago

I agree. Let's close it.