PacktPublishing / Machine-Learning-for-Cybersecurity-Cookbook

Machine Learning for Cybersecurity Cookbook, published by Packt
MIT License
309 stars 180 forks source link

Data Source MalGAN #7

Open hwhitt opened 4 years ago

hwhitt commented 4 years ago

Hi,

I am trying to use your MalGAN code. What data source are you using? When I run your notebook, I need the files. I tried unzipping the files under /Chapter03/Resources. However, the vast majority are oversize (over the max_len) so they are passed. Malware and benign also return log.pred values close to 1 for malware and benign.

Where is the correct data source to use? Also, any guidance of how to interpret the loss vs the pred after running inference? They both return as close to 1 for me for all non nan-samples

Example runtime output FILE: ./Malware1/6bc78d0cece84bc83a1c208b833a1f6ac42c42888873d1031236f0c64c31a580.json file length: 2129280 , Exceed max length ! Ignored ! original score: 0.9998659

FILE: ./Malware1/5bdbecc205fbbf750721439741be0d99b43121c2bc18d703eff1f9a2aeb1b01f.json file length: 27461 pad length: 2746 loss: nan score: nan original score: 0.97386575

and

FILE: ./Benign1/14a14f00714b33b888a3700e78672cc33f1fe6551222e737c668ddeea7f39532.json file length: 84042 , Exceed max length ! Ignored ! original score: 0.99981195

FILE: ./Benign1/14c14215c162cb1f379c7af0eda8dbda25d233a9f3ef93efb4082d716d7c805b.json file length: 31448 pad length: 3144 loss: nan score: nan original score: 0.99534875

FILE: ./Benign1/0d588c2fcb6695d9260ab8a447abd19a32c302a248d2b57c10b4c6eb25d5a95e.json file length: 54 pad length: 5 loss: 0.9297062 score: 0.95742595 original score: 0.24014717

jeanimal commented 4 years ago

This does not solve how to get this working, but I think I understand why performance is so poor. The techniques are designed to analyze static executable files but the data in the Resources directory is dynamic executable behavior data.

First, I found the original data source of the data in the Resources directory and copied over the descriptions:

https://www.kaggle.com/goorax/dynamic-analysis-of-android-malware-of-2017 "Over 4000 malicious apps dynamically analyzed on LG Nexus 5 device farm (API 23)"

https://www.kaggle.com/goorax/dynamic-analysis-of-android-benign-apps-of-2017 "Over 4300 benign apps dynamically analyzed on LG Nexus 5 device farm (API 23)"

The files are from dynamic analysis-- behavioral data (dynamic analysis).

But the technique being used in the MalGAN recipe is based on https://github.com/j40903272/MalConv-keras. They link to this study which is a static analysis of executable files: https://arxiv.org/abs/1710.09435 They explain how "raw bytes presents a sequence problem" and their technique "allows interpretable sub-regions of the binary to be identified"-- but behavioral data is not raw bytes so that is all irrelevant. It's not clear why the MalGAN code provided should be effective when applied to behavioral data, so I'm not surprised at the poor results.

Sadly j40903272 does not provide the data their study was based on, save for this one example file: https://github.com/j40903272/MalConv-keras/tree/master/saved/adversarial_samples

So I do not yet know of an appropriate source of data to try this technique on, where to get a croups of labelled executables (benign vs. malware).