Closed pedramamini closed 1 year ago
I think the way forward here is with globbing, examples:
https://www.cisecurity.org/insights/blog/*
https://www.elastic.co/blog/*
https://flashpoint.io/blog/*
https://www.paloaltonetworks.com/blog/network-security/*
https://www.zscaler.com/blogs/security-research/*
Hit the top-level page and pull all URLs that match the glob, if the individual URL hasn't been retrieved, do so.
This can be achieved with the new sitemap
source, which parses the entire sitemap in search of all available blogs.
sources:
- name: inquest-blog
module: sitemap
url: https://inquest.net/sitemap.xml
An increasing trend we're seeing is for folks to forego RSS feeds on their blogs. To capture these sources, a general web scraping approach must be used. I propose we allow for the definition of a URL regex that will be leveraged for scraping with state detection for previously scraped blogs. A list of blogs to test this against include:
cisecurity.org, elastic, flashpoint.io, palo, proofpoint, recordedfuture, redcanary, secureworks, securityintelligence.com, splunk, zscaler