ScottMansfield / widow

Distributed, asynchronous web crawler
GNU Lesser General Public License v2.1
26 stars 4 forks source link

Add support for If-Modified-Since and ETag headers #3

Open ScottMansfield opened 9 years ago

ScottMansfield commented 9 years ago

ETag headers can be returned by the server to give a token to compare against. If-Modified-Since will check if the page has been modified since the last access. Both will reduce load on the servers during crawling.

If-Modified-Since: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25 ETag: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.19

ScottMansfield commented 9 years ago

crawler-commons may be usable here: https://github.com/crawler-commons/crawler-commons