Open corentinllorca opened 5 years ago
Feature engineering brainstorm
some count features (number of lines, number of function definition, number of loops, number of includes, number of bitwise operations, count of strings like 0xa753a6f5U
, count of large arrays (and other structures), count of large numerical constants)
BoW (in a a smart way (e.g. combining all m-digit numbers for all m, separating parentheses, hashtags and brackets from surrounding characters before running the tokenizer... ))
C tokenizer
supervised term weighting (basically some tf-idf where the idf term is informed by how discriminative a word is to the classification problem) - with a regularization term that prevents undesired target leakage
highlight presence of some crypto terms from a keyword list we can fetch from WindRiver's crypto-detector
Some necessary preprocessing :
95% accuracy with xgboost on the simple features that we extracted. Accuracy obtained on 2150 files.
Just a though: All the features that appear as counts must, in my opinion, be divided by the number of lines of that file. That way it becomes a density. I think that makes more sense than the raw number of occurences that will be biased towards long files. I think it adds complexity to the classification task in a good way.
Let me know your thoughs on that
Yes, thought about it and it should be done. Although this rings false to me:
That way it becomes a density.
No guarantee that the result will be < 1 and the sum of features won't be = 1; it's rather a way to keep features within fairly low bounds
you're right, let us say intensive values :)
This discussion will contain the list of features that we'll extract from the data.
Base features (keyword matching):
Keyword search in the variable names and the comments (wind-river code has a pretty complete list of keywords)
Number of bitwise operations: Hadrien has already started that
More advanced features: look at https://arxiv.org/abs/1706.02769 (sourceforager)