Open vietvo89 opened 3 years ago
We used the raw binary dataset for EMBER. Since we partnered with Elastic on this research that was not too much of an issue. If your research institution can get a VirusTotal license you can get all the raw EMBER files that way. There is also some pre-trained weights in the repo.
If you need to train from scratch, I'd recommend swapping to malware family classification. I would think the results would be highly comparable. You could use VirusShare + @seymour1 's labeling project https://github.com/seymour1/label-virusshare + AVClass to get a bunch of families. The new Sophos dataset https://github.com/sophos-ai/SOREL-20M is also an option. While they do not make the benign files available, the malicious ones are, and they have some functionality/family type information available to use as well.
Thank Raff, I have just found the website for Sophos dataset today. I think there are plenty of ways to collect malware but it seems that in malware research community will not use some public dataset to compare and benchmark. So various researchers have their own dataset that may take time to collect and report the reliability of that dataset. It is not like Vision domain where researchers have some large and reliable public datasets. So EMBER is fantastic in terms of large and reliable public dataset but many studies requires raw binary files and this could be a limit to its progress.
I have access to download malware from VirusTotal but they do not allow me to query amd download based on hash number since I have academic access only. I found some pages like http://www.portablefreeware.com/ to download benign software manually but if I need thousands of samples, it could be a big problem.
seems that in malware research community will not use some public dataset to compare and benchmark.
Its not that people don't want to, its a legal problem that they generally can't. A good representative benign corpus has lots of executable programs that people install in different environments, develop internally, and more. But in every one of those cases, the executable is usually either: 1) a product that is sold for a fee, and the owner would not want distributed for free, 2) an intrinsically internals tool or product, which may or may not be considered proprietary, and not want distributed. In either case, copyright laws apply, and the data just can't be shared. Its a huge challenge within this field that has only recently started to make better progress with stuff like EMBER and SOREL-20M, but we've got a long way to go.
I found some pages like http://www.portablefreeware.com/ to download benign software manually but if I need thousands of samples, it could be a big problem.
Unfortunately stuff like that will not get you anywhere near the number of executables you need, or produce a representative corpus that generalizes to real-world data. This is actually something I invested in my first paper.
Thank Raff, I see your points. I just read this paper yesterday and now realize that you are the author of that paper too. I was surprised how you could get tons of MS window files. Anyway, for any research, data is the first crucial step to get somewhere. Due to some limitations I cannot obtain thousands of benign samples in an effective way. If you have figured out a way to collect a large number of benign samples from somewhere or you can release your dataset for research purpose, it would be great for me and benefits my research. Because this is my PhD topic right now and I want to carry out both attack and defense sides.
I don’t have that, if I did I would write a new paper on that :)
But that’s what I was saying in my first reply. Instead of attack benign vs malicious, use malware family labels to create “family A vs family B”. Or some mix of family vs the compliment set. That way you can build a corpus for free that has all the fundamental intracices and test “benign” vs “malicious” with complex data.
Get Outlookhttps://aka.ms/qtex0l for iOS
From: vietvo89 @.> Sent: Monday, April 5, 2021 8:35:40 PM To: NeuromorphicComputationResearchProgram/MalConv2 @.> Cc: Raff, Edward [USA] @.>; Comment @.> Subject: [External] Re: [NeuromorphicComputationResearchProgram/MalConv2] Which dataset is used to train and test the performance ? (#1)
Thank Raff, I see your points. I just read this paper yesterday and now realize that you are the author of that paper too. I was surprised how you could get tons of MS window files. Anyway, for any research, data is the first crucial step to get somewhere. Due to some limitations I cannot obtain thousands of benign samples in an effective way. If you have figured out a way to collect a large number of benign samples from somewhere or you can release your dataset for research purpose, it would be great for me and benefit my research. Because this is my PhD topic right now and I want to carry out both attack and defense sides.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/NeuromorphicComputationResearchProgram/MalConv2/issues/1*issuecomment-813741323__;Iw!!May37g!dxK4OPs2NsCVr9HIa7sV6Wr8DYCcR5ANvrZRzmbnQ2oVskQaWiZ__OYUfKuhqA1a$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABHTEKB2JXWTKIS2CTRD6O3THJJNZANCNFSM42L44EEA__;!!May37g!dxK4OPs2NsCVr9HIa7sV6Wr8DYCcR5ANvrZRzmbnQ2oVskQaWiZ__OYUfIW_ORpv$.
:)) I see your point. Let me take a close look at your recommendation.
Hello
In your paper, it seems you only use EMBER for PE files while using Common Crawl to collect benign PDF files and VirusShare for malicous ones. For EMBER dataset, did you use its raw binary dataset or extracted data provided on GitHub repo? EMBER is a dataset that does not contain raw binary files. But I think Malconv and your proposal need raw binary files. I am carrying out research of attack side so I need to train a malware detection model.
Thanks