Malware - Stats - Githubissues

Anti-Malware-Alliance / malware-stats

Tool to Collect and Generate Malware Statistics from Different Sources

0 stars 0 forks source link

Malware - Stats #1

Open rothoma2 opened 3 months ago

rothoma2 commented 3 months ago

It is important to have statistics on some of the commonly observed Malicious Delivery Methods and file extensions.

Requirements.

A web scrapper tool, that scrapes and get publicly disclosed information from several sources (Malware Sandbox sites) and aggregated them to produce statistics such as File Extension, Malware Families etc).

Sources

Things to explore.

Amount of pages that can be scrapped before Rate Limit, or Captcha Kick in.
Parse HTML pages, and extract valuable data.

Example.

Collect last 10K malicious files ( For Windows) reported on each site, and aggregate them per File Extension.

poneoneo commented 3 months ago

To avoid barriers like captcha we should really think about deducated tool like bright data website. Some youtubers offers 10 dollars free to try you should take a look on this.

rothoma2 commented 3 months ago

@poneoneo maybe we should see if this pages have some captcha or ratelimit first. I think the project would be severly limited if we need to depend on a pay per use service, such a bright data website.

poneoneo commented 3 months ago

Ok @rothoma2 I will check this out. Maybe tools like playwright or selenium will be enought to behave like a real browser an overcome user-agent and captcha barrier

Ohnoimded commented 3 months ago

Selenium -base is working fine for it with uc driver

rothoma2 commented 3 months ago

Cool, maybe someone can upload some base code and we start extending from there.

Ohnoimded commented 3 months ago

The code will bypass all checks on app.any.run but can only get till page 5 as going further is restricted by them. Need to implement actual scraping part for extracted rows.

from seleniumbase import SB
import time
import random

with SB(uc=True) as sb:
    print("Entering Website")
    sb.open_html_file("https://app.any.run/submissions/")
    sb.click("#history-filterBtn")
    sb.click("div.btn-group:nth-child(1) > button:nth-child(1)")
    time.sleep(random.randrange(0,2))
    sb.click("div.btn-group:nth-child(1) > div:nth-child(2) > ul:nth-child(1) > li:nth-child(1) > a")
    sb.click("#historySearchBtn")
    time.sleep(random.randrange(2,3))
    for i in range(5):
        time.sleep(random.randrange(1,2))
        soup = sb.get_beautiful_soup()
        extracted_row = soup.css.select("div.history-table--content__row")
        # I haven't done the bs part yet. Something like this. 
        sb.click(".history-pagination__next")