Closed redouane-dziri closed 4 years ago
Update on our progress
working on branch issue-14
We coded up some approximate count features using pattern matching:
int
declarationlong
declarationwhile
+ for
)To reduce noise (pattern matching...) we strip the files of comments before generating the features. (we strip comments using.. guess what... pattern matching!). Seems to work well after inspection of a few files .
We also fetched regex patterns to detect function declarations but the unrolling of the BNF grammar generated is really big and it takes forever to run. So would rather rely on (the more precise anyway) function detection enabled by syntactic trees. (Would love an update on if this is feasible @arthurherbout @arnaudstiegler )
Currently working on exploring those features. Plots we have in mind (mind you, the plots can be faceted on data_source
, class
, header
) - boxplot and/or histograms:
average number of lines (distribution as well?)
counts of int
and long
declarations
counts of loops
counts of bitwise ops
counts of file includes
thinking of also fetching the top-k includes - maybe here is a strong bias in crypto files invoking the same crypto libraries/files that we should watch out for (maybe strip some includes later)
Also some exploration of the raw data:
Counts do not suffice, need to take into account the different lengths of files. Will also code up binary features and associated plots:
Next steps we'll be taking in the coming days:
Is this still relevant? @corentinllorca With Hadrien's experiments I feel like we don't need to train a model on our features alone anymore. If not, let's close this.
I agree. Let's close it.
See #2