Anti-Malware-Alliance / secret-harvest

Python Project to Automate Collection of Snippets with Leaked Secrets in Code to Build a Dataset for ML Trainning.
MIT License
3 stars 0 forks source link

Your Daily Dose of Malware #3

Open rothoma2 opened 7 months ago

rothoma2 commented 7 months ago

The Problem

Security Analyst are constantly in a need for Fresh Malware Samples. The fight against malware is largely driven by ML Models that use static, or dynamic analysis. This is a large field of study. For this Analyst and Researchers require a large amount of Fresh Malware Samples.

As malware advances, new bypass techniques are been developed in a typical Cat and Mouse Game. Models need to be constantly evaluated against their real world performance, and be updated.

For this a recurrently refreshed Dataset is needed. Most Research on this topic provides a "One point on time view" where the researcher collects samples, train the model and publish results at this point on time. But later models are not been evaluated, or retrained based on a recurrently collected model.

The Requirements

Write a Python Package, (wheel, using poetry) Linux CLI tool that connects to several datasources and collect malware samples that have been published recently (last 24, 48 or 72 hours.

For you you can derive inspiration from some previous work: https://github.com/woj-ciech/Daily-dose-of-malware ( This project is old an unmaintained, I contacted the Author and he advised to rewrite a new one, or fork and expand on the old one).

This is an initial list of where samples can be collected.

Other Requirements.

Keep the tool and script simple. It will be enhanced later.

lucifercr07 commented 7 months ago

Few pointers to clarify:

rothoma2 commented 7 months ago

Answers inline to clarify further.

1) Yes, in essence this tool is a scrapper of samples published by other sources. We will have in the future other projects to collect sources from other places (torrent sites, hunting, malware sandbox) but the easier way to obtain fresh samples for now is to collect them from other sources.

2) For this is also usefull to think about several use cases:

I think we should keep functionality to Non-API Key, and API-Key. Usually API-Key has higher sample collection count, but requiring the user to obtain API Keys can be a "Gatekeep", something the organization have long or obscure process to obtain API Keys so "non-api key" collection although restricted in amount should be the first functionality. API-Key Collection can be an enhanced functionality.

3) The tool, should be a python CLI tool to trigger collection at execution time, and collect samples for last 24-48H for example. Later the user can do a script Bash wrapper around this, to trigger on a cronjob. A sample bash script can also be provided in the repo to go along the tool. The tool itself does not need time scheduling capabilities.

4) Techniques around preventing blocking, are up to the user. The tool can implement some very simple "sleep and retry after x many minutes" Other measures could be users, using VPNs for Collection, or other means.

5) We do plan to use this script to collect samples recurrently and share them. Probably for our tool we are not looking for a portal with API Keys, but other means like

Malware samples are necessary for research but sometimes can be misclassified due to Policies, so putting it on GitHub, or Kagle could run the risk of Takedown. I think the risk is low, so we should do it and also release them on Torrent, which will be unlikely way to be censored.