adamjanovsky / AndroidMalwareCrypto

The analysis of cryptography in Android malicious applications
MIT License
3 stars 0 forks source link

Discard features that result into "cheating" #10

Closed adamjanovsky closed 2 years ago

adamjanovsky commented 2 years ago

Going briefly over the SHAP results showed that features unrelated to cryptography are actually the most influential on the detection models. Namely, we talk about anything related to file size of the sample. While we cannot at the moment fully normalize all features (as we don't have LoC numbers for all samples, see #9) for the already processed dataset, we should do so in the Avast dataset.

We should for now drop the n_classes feature. Some of the information will still leak into the other features, but we can't easily fix that now.

By the way, @dmacko232, what exactly does file_sqrt_count feature does? Should we drop it as well?

Possibly, in the paper we'll use the Androzoo dataset merely to describe usage of crypto API by malware authors and we'll leave the new Avast dataset for malware classifiers. We can correctly normalize the data on the new dataset.

Still, it is an interesting observation that while malicious samples are in general much smaller in size, they contain much more crypto. We should explore the Avast dataset. My current hypothesis now is that: Despite malware samples contain fewer lines of code, they still contain more cryptography.

dmacko232 commented 2 years ago

I already dropped the n_classes feature. file_sqrt_count (the name is not really descriptive) represents square root count of classes (files) that use crypto API (so it's kinda related to crypto - but maybe the file size can leak through this feature). Square root was used because the distribution of the feature was skewed.

Overall, I think we should go over the features once again and decide if any of them need to be dropped.