This repository is part of the following publication: https://dl.acm.org/doi/10.1145/3560830.3563726
Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual and Behavioral Malware Representations
:warning: The model is a research prototype, provided as-is, without warranty of any kind, in a pre-alpha state.
Dataset structure used for model pre-training is as follows:
Raw PE samles and in-the-wild filepaths are not disclosed due to Privacy Policy. However,
- PE emulation dataset available in [emulation.dataset](data/emulation.dataset/)
- Filepath dataset (open sources only, in-the-wild paths used for pre-training are excluded):
- augmented [samples](data/path.dataset/dataset_malicious_augumented.txt) and [logic](data/path.dataset/augment/augmentation.ipynb)
- [paths](data/path.dataset/dataset_benign_win10.txt) from clean Windows 10 host
## Citation
If you are inspired by the work or use data, please cite us:
```bibtex
@inproceedings{10.1145/3560830.3563726,
author = {Trizna, Dmitrijs},
title = {Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual and Behavioral Malware Representations},
year = {2022},
isbn = {9781450398800},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3560830.3563726},
doi = {10.1145/3560830.3563726},
booktitle = {Proceedings of the 15th ACM Workshop on Artificial Intelligence and Security},
pages = {127–136},
numpages = {10},
keywords = {reverse engineering, neural networks, malware, emulation, convolutions},
location = {Los Angeles, CA, USA},
series = {AISec'22}
}
```
## Architecture
Hybrid, modular structure for **malware classification**. Supported modules:
- 1D convolution neural network analysis of *API call sequence* obtained from [Speakeasy emulator](https://github.com/mandiant/speakeasy/)
- 1D convolution neural network analysis of filepath at the moment of execution (Kyadige and Rudd et al.,
## Environment Setup
Tested on Python `3.8.x` - `3.9.x`. Because of a large number of dependencies with specific versions (due to pre-trained machine learning models), we suggest using a virtual environment or `conda`:
```bash
% python3 -m venv QuoVadisEnv
% source QuoVadisEnv/bin/activate
(QuoVadisEnv)% python -m pip install -r requirements.txt
```
## Usage
API interface is available under `models.py`.
### Definition of classifier
```python
from models import CompositeClassifier
classifier = CompositeClassifier(meta_model = "MultiLayerPerceptron",
modules = ["ember", "emulation"],
root = "/home/user/quo.vadis/",
load_meta_model = True)
```
Available pretrained configurations:
```python
meta_model = 'LogisticRegression', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation', 'filepaths']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'MultiLayerPerceptron', modules = ['emulation']
meta_model = 'MultiLayerPerceptron', modules = ['filepaths']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation', 'filepaths']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'XGBClassifier', modules = ['emulation']
meta_model = 'XGBClassifier', modules = ['filepaths']
```
### Evaluation on PE list
```python
pefiles = os.listdir("/path/to/PE/samples")
x = classifier.preprocess_pelist(pefiles)
probs = classifier.predict_proba(x)
```
You can use `predict_proba_pelist()` instead of `predict_proba()` to get probabilities out of the PE list right away instead of a preprocessed array:
```python
probs = classifier.predict_proba_pelist(pefiles)
```
Given that `filepaths` is specified in `modules = `, you have to specify the filepaths of the PE sample at the moment of execution using the `pathlist=` argument:
```python
filepaths = pd.read_csv(filepaths.csv, header=None)
probs = classifier.predict_proba_pelist(pefiles, pathlist=filepaths.values.tolist())
```
*Note!* `len(pefiles) == len(filepaths)`
### Re-Training
Using the `fit_pelist()` method and providing ground true labels for PE files -- malware (1) or benign (0):
```python
labels = load_labels()
classifier.fit_pelist(pefiles, labels, pathlist=filepaths.values.tolist())
```
### Example
An example usage can be found under `example.py`:
```text
# python example.py --example --how ember emulation filepaths
[*] Loading model...
WARNING:root:[!] Loading pretrained weights for ember model from: ./modules/sota/ember/parameters/ember_model.txt
WARNING:root:[!] Loading pretrained weights for filepath model from: ./modules/filepath/pretrained/torch.model
WARNING:root:[!] Using speakeasy emulator config from: ./data/emulation.dataset/sample_emulation/speakeasy_config.json
WARNING:root:[!] Loading pretrained weights for emulation model from: ./modules/emulation/pretrained/torch.model
WARNING:root:[!] Loading pretrained weights for late fusion MultiLayerPerceptron model from: ./modules/late_fustion_model/MultiLayerPerceptron15_ember_emulation_filepaths.model
[*] Legitimate 'calc.exe' analysis...
WARNING:root:[!] Taking current filepath for: evaluation/adversarial/samples_goodware/calc.exe
WARNING:root: [+] 0/0 Finished emulation evaluation/adversarial/samples_goodware/calc.exe, took: 0.19s, API calls acquired: 6
[!] Given path evaluation/adversarial/samples_goodware/calc.exe, probability (malware): 0.000005
[!] Individual module scores:
ember filepaths emulation
0 0.000015 0.00319 0.062108
WARNING:root: [+] 0/0 Finished emulation evaluation/adversarial/samples_goodware/calc.exe, took: 0.11s, API calls acquired: 6
[!] Given path C:\users\myuser\AppData\Local\Temp\exploit.exe, probability (malware): 0.549334
[!] Individual module scores:
ember filepaths emulation
0 0.000015 0.999984 0.062108
[*] BoratRAT analysis...
WARNING:root: [+] 0/0 Finished emulation ./b47c77d237243747a51dd02d836444ba067cf6cc4b8b3344e5cf791f5f41d20e, took: 0.25s, API calls acquired: 194
[!] Given path %USERPROFILE%\Downloads\BoratRat.exe, probability (malware): 0.9997
[!] Individual module scores:
ember filepaths emulation
0 0.035511 0.999602 0.96526
WARNING:root: [+] 0/0 Finished emulation ./b47c77d237243747a51dd02d836444ba067cf6cc4b8b3344e5cf791f5f41d20e, took: 0.25s, API calls acquired: 194
[!] Given path C:\windows\system32\calc.exe, probability (malware): 0.0392
[!] Individual module scores:
ember filepaths emulation
0 0.035511 0.086567 0.96526
```
## Evaluation
More detailed information about modules and individual tests:
- `./modules/emulation/`
- `./modules/filepaths/`
- `./modules/sota/`
Note! Parameters for the `sota` models can be downloaded from [here](https://github.com/endgameinc/malware_evasion_competition/tree/master/models).
Performance of this model on the proprietary dataset: ~90k PE samples with filepaths from real-world systems:
DET and ROC curves:
Detection rate with fixed False Positive rate:
## Future work
- Experiments with **retrained** MalConv / Ember weights -- it makes sense to evaluate them on the same distribution
- *Note*: this, however, does not matter since our goal is **not** to compare our modules with MalConv / Ember directly but to improve them. For this reason, it is even better to have original parameters. The main takeaway -- adding multiple modules together allows boosting results drastically. At the same time, each is noticeably weaker (even the API call module, which is trained on the same distribution).
- Run GAMMA against composite solution (not just ember/malconv modules) - it looks like attacks are highly targeted. Interesting if it will be able to generate evasive samples against a complete pipeline .. (however, defining that in `secml_malware` might be painful ...)
- Work on `CompositeClassifier()` API interface:
- make it easy to take a PE sample(s) & additional document options (providing PE directory, predefined emulation report directory, etc.)
- `.update()` to overtrain network with own examples that were previously flagged incorrectly
- work without submitted `filepath` (only PE mode) - provide paths as separate argument to `.fit()`?
- Additional modules:
- (a) Autoruns checks (see Sysinternals book for a full list of registries analyzed)
- (b) network connection information
- etc.