Allow for scraping based ingestion of blogs without RSS feeds.

InQuest / ThreatIngestor

Extract and aggregate threat intelligence.

https://inquest.readthedocs.io/projects/threatingestor/

GNU General Public License v2.0

831 stars 135 forks source link

Allow for scraping based ingestion of blogs without RSS feeds. #112

Closed pedramamini closed 1 year ago

pedramamini commented 2 years ago

An increasing trend we're seeing is for folks to forego RSS feeds on their blogs. To capture these sources, a general web scraping approach must be used. I propose we allow for the definition of a URL regex that will be leveraged for scraping with state detection for previously scraped blogs. A list of blogs to test this against include:

cisecurity.org, elastic, flashpoint.io, palo, proofpoint, recordedfuture, redcanary, secureworks, securityintelligence.com, splunk, zscaler

pedramamini commented 1 year ago

I think the way forward here is with globbing, examples:

https://www.cisecurity.org/insights/blog/*
https://www.elastic.co/blog/*
https://flashpoint.io/blog/*
https://www.paloaltonetworks.com/blog/network-security/*
...
https://www.zscaler.com/blogs/security-research/*

Hit the top-level page and pull all URLs that match the glob, if the individual URL hasn't been retrieved, do so.

battleoverflow commented 1 year ago

This can be achieved with the new sitemap source, which parses the entire sitemap in search of all available blogs.

sources:
  - name: inquest-blog
    module: sitemap
    url: https://inquest.net/sitemap.xml