EMBER LGBT feature extraction error

zangobot commented 2 years ago

I am trying to evaluate the composite classifier (including both the EMBER and MALCONV modules). While the malconv model has no pretrained (as written in the readme), I am evaluating EMBER with the provided pretrained model by Elastic. The feature extractor crashes at this step (it crashes inside Ember, but it is the same line in the code provided inside the project): https://github.com/dtrizna/quo.vadis/blob/edc59e67c8ca8fe4c5809235b20931a24078154a/modules/sota/ember/features.py#L144

this happens because the lief method returns None (even with legitimate software), and the exception is not caught. Hence, execution halts. I am trying to better isolate this issue.

Attaching the sample code I wrote to test this:

if __name__ == '__main__':
    base_dir = os.path.join(os.path.dirname(__file__), 'FOLDER')
    files = [os.path.join(base_dir, f) for f in os.listdir(base_dir) if not f.startswith('.')]
    print(files)
    clf = CompositeClassifier(modules=['ember', 'emulation'])
    x = clf.preprocess_pelist(files)
    print(clf.predict_proba(x))

dtrizna commented 2 years ago

Hey!

Sorry for the delayed reply, since at the moment I am pausing work on this project due to resource constraints.

I acknowledge and am able to replicate the issue. Looks like they've altered how lief behaves between 0.11.x and 0.12.x -- .section_from_offset() returns None instead of throwing a lief.not_found exception, which is handled by features.py.

Rather than fixing Ember's features.py, I added lief==0.11.5 (the last acknowledged version to behave as expected by features.py) in the requirements.txt. Be sure to reinstall life!

P.S. At the moment inference based on only two modules (['ember', 'emulation']) might need to retrain a custom, late fusion model, not included in modules/late_fusion_model/*.model files. You can do this with provided arrays, code examples available in evaluation/composite/*ipynb. However, I plan to do this myself in the nearest future (and include this functionality in CompositeClassifier class API), since I understand this is the most desirable combination ('cause, acquiring paths is hard, and adding MalConv on top of Ember oftentimes is not really worth CPU cycles).

zangobot commented 2 years ago

Thank you for the answer! So, the pretrained models that are provided are examples, or they can be already used? Can I use them separately?

dtrizna commented 2 years ago

Pre-trained models are functional.

The thing is that there are (1) "early" pass models -- Ember GBDT, MalConv, emulation, and filepath models, and (2) "late" pass model, that produces final decision out of "early" pass outputs.

Pre-trained "late" pass model is provided for all 4 modules enabled and expects 4-dimensional input. To use only 2 models, you need to train a new "late" pass model that expects only 2 modules.

I have done that during the experiments but didn't push the "late" pass models to the repository. Code how to do that is in those Notebooks I've referenced above, and this can be done easily since "early" pass arrays for the training set are provided in the repo.

Hopefully, I will address that myself before BlackHat and implement, it so it is seamless in the API.

Architecture is awkward, but (1) it supports modularity and more modules can be added in the future w/o retraining the other parts of the composite model, (2) what'd you expect from a single-person pet-project research artifact? :)

zangobot commented 2 years ago

Mmmmh not sure I have understood the "early" and "late", but I'll still have a look around on the project! Thank you for your answers, I totally understand the situation 👍

dtrizna commented 2 years ago

Update -- included pretrained models for the variable combination of modules.

See usability examples in README or in example.py. TL;DR: Any of these are available:

meta_model = 'LogisticRegression', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation', 'filepaths']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'MultiLayerPerceptron', modules = ['emulation']
meta_model = 'MultiLayerPerceptron', modules = ['filepaths']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation', 'filepaths']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'XGBClassifier', modules = ['emulation']
meta_model = 'XGBClassifier', modules = ['filepaths']

@zangobot, your code specified in the original issue message should work as-is, but don't forget git pull.

zangobot commented 2 years ago

Thanks!!!

zangobot commented 2 years ago

So, another related comment: I tried to use the emulation and ember fused model as in the script I sent. For a 100% malware and 100% goodware, it returns the same classe (first one, which I assume is goodware). Both of them are very known (one is a publicly available Petya sample, and the other one is Calculator). Is this happening because the fused model is not trained? Also, the numbers I see do not sum to 1, I am confused on the output of the model.

print(clf.predict_proba_pelist(pelist=files, return_module_scores=True))

(array([[8.78850041e-01, 1.21149959e-01],
       [9.99051378e-01, 9.48622057e-04]]),
       ember  emulation
0  0.999992   0.007340
1  0.000141   0.062108)

dtrizna commented 2 years ago

Hey, Luca!

Thanks for the questions!

1. The first element (array) is the output or "composite" decision -- what are the model predictions taking into account all modules you requested. 0th row -- prob it is good, 1th row -- prob it is bad. These probs sum up to one:

>>> a = [8.78850041e-01, 1.21149959e-01]
>>> sum(a)
1.0
>>> b = [9.99051378e-01, 9.48622057e-04]
>>> sum(b)
1.000000000057

Afaik this is how most of the binary classifiers provide the output.

Module scores are just the complementary table with the probability the sample is malicious (prob. it is good is omitted). I considered it to be self-explanatory, but it might be I should think about improving readability here...

I assume in your example, index 0 should correspond to the Petya sample. Ember thinks it is really bad with 0.9999, but the emulation model gives a low probability it is bad, only 0.007340. And based on the array, the final model's decision is only 1.21149959e-01, i.e., ~12% of maliciousness.

2. And here comes the question of why emulation model doesn't detect Petya. Potentially, it is a drawback from the dataset I used, please refer to the table here:

https://github.com/dtrizna/quo.vadis#dataset

Here might come in handy a paper that describes data and its collection, and I hope I will be able to release it soon (it is ready but going through a peer review). I might send you a draft if you are interested.

TL;DR:

We collected in-the-wild samples during some specific time period (train & validation sets ~Jan 2022, test set ~Apr 2022). Therefore, we might lack some representative functionality of families that were active earlier, like Petya.

Dataset is large for behavioral malware analysis (usually, folks use a few thousand samples only because detonating malware in VM is costly, we processed ~100k thanks to emulation instead of VM), but of course, it still doesn't generalize across "true" malware distribution in a sense like the Ember dataset does (~1M static feature vectors).

Still, the model should've seen some ransomware samples since this label is in the trainset. It would be nice to take a look at Petya's emulation report -- considering what functionality is revealed.

zangobot commented 2 years ago

Hello, thank you for the reply, I was trusting my eyes more than the math that I should have computed :D Anyway, I usually use Petya as a test since it is a very (in)famous malware. Can't remember if I am using a packed / obfuscated version of it. I guess I will try the emulation network with more recent malware then.

Thanks again for the support!

dtrizna commented 2 years ago

No problem with questions, I am pleased! :)

One note I forgot to add -- I cannot release raw PE datasets because of the privacy policy, but emulation reports are released publicly (password infected):

https://github.com/dtrizna/quo.vadis/tree/main/data/emulation.dataset

Archives contain all training, validation, and test set hashes in a naming convention <hash>.json. All samples I tested last week manually were already on VirusTotal (still, some might be missing), so it is possible to fetch full PEs too with a little bit of work.

zangobot commented 2 years ago

Good! Thank you for that :)

dtrizna / quo.vadis

EMBER LGBT feature extraction error #1