Open rothoma2 opened 3 months ago
To avoid barriers like captcha we should really think about deducated tool like bright data website. Some youtubers offers 10 dollars free to try you should take a look on this.
@poneoneo maybe we should see if this pages have some captcha or ratelimit first. I think the project would be severly limited if we need to depend on a pay per use service, such a bright data website.
Ok @rothoma2 I will check this out. Maybe tools like playwright or selenium will be enought to behave like a real browser an overcome user-agent and captcha barrier
Selenium -base is working fine for it with uc driver
Cool, maybe someone can upload some base code and we start extending from there.
The code will bypass all checks on app.any.run but can only get till page 5 as going further is restricted by them. Need to implement actual scraping part for extracted rows.
from seleniumbase import SB
import time
import random
with SB(uc=True) as sb:
print("Entering Website")
sb.open_html_file("https://app.any.run/submissions/")
sb.click("#history-filterBtn")
sb.click("div.btn-group:nth-child(1) > button:nth-child(1)")
time.sleep(random.randrange(0,2))
sb.click("div.btn-group:nth-child(1) > div:nth-child(2) > ul:nth-child(1) > li:nth-child(1) > a")
sb.click("#historySearchBtn")
time.sleep(random.randrange(2,3))
for i in range(5):
time.sleep(random.randrange(1,2))
soup = sb.get_beautiful_soup()
extracted_row = soup.css.select("div.history-table--content__row")
# I haven't done the bs part yet. Something like this.
sb.click(".history-pagination__next")
It is important to have statistics on some of the commonly observed Malicious Delivery Methods and file extensions.
Requirements.
A web scrapper tool, that scrapes and get publicly disclosed information from several sources (Malware Sandbox sites) and aggregated them to produce statistics such as File Extension, Malware Families etc).
Sources
Things to explore.
Example.
Collect last 10K malicious files ( For Windows) reported on each site, and aggregate them per File Extension.