Step 2: Feature Extraction

corentinllorca commented 5 years ago

This discussion will contain the list of features that we'll extract from the data.

Base features (keyword matching):

Keyword search in the variable names and the comments (wind-river code has a pretty complete list of keywords)
Number of bitwise operations: Hadrien has already started that

More advanced features: look at https://arxiv.org/abs/1706.02769 (sourceforager)

redouane-dziri commented 5 years ago

Feature engineering brainstorm

some count features (number of lines, number of function definition, number of loops, number of includes, number of bitwise operations, count of strings like 0xa753a6f5U, count of large arrays (and other structures), count of large numerical constants)
BoW (in a a smart way (e.g. combining all m-digit numbers for all m, separating parentheses, hashtags and brackets from surrounding characters before running the tokenizer... ))
C tokenizer
supervised term weighting (basically some tf-idf where the idf term is informed by how discriminative a word is to the classification problem) - with a regularization term that prevents undesired target leakage
highlight presence of some crypto terms from a keyword list we can fetch from WindRiver's crypto-detector

Some necessary preprocessing :

remove header comments on library files (copyright blabla might bias the model because the non-crypto file is largely not from libraries and will not contain the similar disclaimers)

Hadrien-Cornier commented 5 years ago

Run MOSS on all files
Modify idf penalty with Gini/entropy coefficient
merge tf idf and baseline models
implement baseline distance based classifier with these simple features
compare BoW and modified tf/idf
Cross validate to calculate precision

Hadrien-Cornier commented 5 years ago

95% accuracy with xgboost on the simple features that we extracted. Accuracy obtained on 2150 files.

arthurherbout commented 5 years ago

Just a though: All the features that appear as counts must, in my opinion, be divided by the number of lines of that file. That way it becomes a density. I think that makes more sense than the raw number of occurences that will be biased towards long files. I think it adds complexity to the classification task in a good way.

Let me know your thoughs on that

redouane-dziri commented 5 years ago

Yes, thought about it and it should be done. Although this rings false to me:

That way it becomes a density.

No guarantee that the result will be < 1 and the sum of features won't be = 1; it's rather a way to keep features within fairly low bounds

arthurherbout commented 5 years ago

you're right, let us say intensive values :)

arthurherbout / crypto_code_detection

Step 2: Feature Extraction #2