Discard features that result into "cheating"

Going briefly over the SHAP results showed that features unrelated to cryptography are actually the most influential on the detection models. Namely, we talk about anything related to file size of the sample. While we cannot at the moment fully normalize all features (as we don't have LoC numbers for all samples, see #9) for the already processed dataset, we should do so in the Avast dataset.

We should for now drop the n_classes feature. Some of the information will still leak into the other features, but we can't easily fix that now.

By the way, @dmacko232, what exactly does file_sqrt_count feature does? Should we drop it as well?

Possibly, in the paper we'll use the Androzoo dataset merely to describe usage of crypto API by malware authors and we'll leave the new Avast dataset for malware classifiers. We can correctly normalize the data on the new dataset.

Still, it is an interesting observation that while malicious samples are in general much smaller in size, they contain much more crypto. We should explore the Avast dataset. My current hypothesis now is that: Despite malware samples contain fewer lines of code, they still contain more cryptography.

adamjanovsky / AndroidMalwareCrypto

Discard features that result into "cheating" #10