Robust Malware Classification via Deep Graph Networks on Call Graph Topologies

Description

This repository allows to reproduce the experiments of our ESANN 2021 paper:

Errica Federico, Iadarola Giacomo, Martinelli Fabio, Mercaldo Francesco, Micheli Alessio: Robust Malware Classification via Deep Graph Networks on Call Graph Topologies, European Symposium on Artificial Neural Networks (ESANN), 2021.

Requirements

Use this link to fetch the compressed dataset to be processed
PyDGN (we used PyDGN 0.5.0)

Building the dataset

Once you have unzipped the dataset file DATA_NOFEATS.zip, run the following:

1) Original dataset

`python build_dataset.py --config-file DATA_CONFIGS/config_CNRMalwareDataset_NOFEATS.yml`

2) Obfuscated test set

`python build_dataset.py --config-file DATA_CONFIGS/config_CNRMalwareDataset_NOFEATS_OBF_TEST.yml`

Launching the experiments

The PyDGN config files are set to use a gpu (see device: cuda and similar). If you want to use CPUs only, set --max-gpus 0 and change the config files accordingly (see here how). Also, you can remove --debug to enable the CLI and exploit CPU/GPU task parallelism. Adjust the parallelism parameters as you see fit.

1) Baseline

`python launch_experiment.py --config-file MODEL_CONFIGS/config_baseline.yml --splits-folder DATA_SPLITS/ --data-splits DATA_SPLITS/CG/CG_outer1_inner1.splits --data-root DATA_NOFEATS --dataset-name CG --dataset-class cnr_dataset.CNRMalwareDataset --max-cpus 4 --max-gpus 1 --final-training-runs 3 --result-folder CIML_CNR_RESULTS --debug`

2) CGMM

**Pre-condition:** modify the config files to set up a folder where to store the intermediate graph embeddings that will be used by the classifier.

**Note:** both the result folder may end up taking a lot of space to produce intermediate outputs between layers. These files are deleted after each experiment ends, but this might cause troubles when running experiments in parallel. Please consider using a secondary storage as your result folder.

**Unsupervised Embedding Phase**

`python launch_experiment.py --config-file MODEL_CONFIGS/config_CGMM_Embedding.yml --splits-folder DATA_SPLITS/ --data-splits DATA_SPLITS/CG/CG_outer1_inner1.splits --data-root DATA_NOFEATS --dataset-name CG --dataset-class cnr_dataset.CNRMalwareDataset --max-cpus 4 --max-gpus 1 --final-training-runs 3 --result-folder CIML_CNR_RESULTS --debug`

**Supervised Classifier Phase**

`python launch_experiment.py --config-file MODEL_CONFIGS/config_CGMM_Classifier.yml --splits-folder DATA_SPLITS/ --data-splits DATA_SPLITS/CG/CG_outer1_inner1.splits --data-root DATA_NOFEATS --dataset-name CG --dataset-class cnr_dataset.CNRMalwareDataset --max-cpus 4 --max-gpus 1 --final-training-runs 3 --result-folder CIML_CNR_RESULTS --debug`

Inference on obfuscated dataset

Once you have completed the experiments, use the notebook CGMM Inference to perform inference (remember to change the exp paths accordingly) and Confusion Matrix to output the confusion matrix.

Troubleshooting

If you have questions, do not hesitate to contact us! If you find a bug, please open an issue on Github.

diningphil / robust-call-graph-malware-detection

readme