InQuest / ThreatIngestor

Extract and aggregate threat intelligence.
https://inquest.readthedocs.io/projects/threatingestor/
GNU General Public License v2.0
831 stars 135 forks source link

Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match) #101

Closed danieleperera closed 1 year ago

danieleperera commented 4 years ago

Hi guys,

Have you got any other ideas on how to cut down on duplicates without using (Last-Modified/If-Modified-Since/ETag/If-None-Match). I'm trying to use the web.py source but some HTTP response headers don't the above tags.

I was thinking of creating a shasum of the content of a page and saving it as the saved_state and checking it later if there are any new items. However this would only work if you are scraping one page.

cmmorrow commented 4 years ago

Hey @danieleperera, I like that idea. If you want to try to get it working, I'll review the PR.