Have you got any other ideas on how to cut down on duplicates without using (Last-Modified/If-Modified-Since/ETag/If-None-Match). I'm trying to use the web.py source but some HTTP response headers don't the above tags.
I was thinking of creating a shasum of the content of a page and saving it as the saved_state and checking it later if there are any new items. However this would only work if you are scraping one page.
Hi guys,
Have you got any other ideas on how to cut down on duplicates without using
(Last-Modified/If-Modified-Since/ETag/If-None-Match)
. I'm trying to use theweb.py
source but some HTTP response headers don't the above tags.I was thinking of creating a shasum of the content of a page and saving it as the saved_state and checking it later if there are any new items. However this would only work if you are scraping one page.