arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code
8 stars 4 forks source link

Step 2: Feature Extraction #2

Open corentinllorca opened 4 years ago

corentinllorca commented 4 years ago

This discussion will contain the list of features that we'll extract from the data.

Base features (keyword matching):

More advanced features: look at https://arxiv.org/abs/1706.02769 (sourceforager)

redouane-dziri commented 4 years ago

Feature engineering brainstorm

Some necessary preprocessing :

Hadrien-Cornier commented 4 years ago
Hadrien-Cornier commented 4 years ago

95% accuracy with xgboost on the simple features that we extracted. Accuracy obtained on 2150 files.

arthurherbout commented 4 years ago

Just a though: All the features that appear as counts must, in my opinion, be divided by the number of lines of that file. That way it becomes a density. I think that makes more sense than the raw number of occurences that will be biased towards long files. I think it adds complexity to the classification task in a good way.

Let me know your thoughs on that

redouane-dziri commented 4 years ago

Yes, thought about it and it should be done. Although this rings false to me:

That way it becomes a density.

No guarantee that the result will be < 1 and the sum of features won't be = 1; it's rather a way to keep features within fairly low bounds

arthurherbout commented 4 years ago

you're right, let us say intensive values :)