Does the webcrawler take the `lastmod` node into account?

DeuxHuitHuit / algolia-webcrawler

Simple node worker that crawls sitemaps in order to keep an algolia index up-to-date

https://www.npmjs.org/package/algolia-webcrawler

Other

46 stars 18 forks source link

Does the webcrawler take the `lastmod` node into account? #15

Open mrtnvh opened 7 years ago

mrtnvh commented 7 years ago

Does the webcrawler take the <lastmod> node or already existing URLs present in the Index into account when running?

Current problem: We have a website with +2000 URLs that that needs to be checked for new url's and changed content daily. The idea is to run the webcrawler daily through a scheduled task.

If all URLs are updated daily, our monthly operations limit could get hit rather quickly.

nitriques commented 7 years ago

It's not an implemented feature right now, but I've been interested by the idea. What I had in mind was to properly set the http request headers and get a 304 back, but that would need someplace to store the data locally, to avoid the extra request to algolia.

But we could only skip the update part even though we do not get a 304 from the server (which I never thought about). I would be happy to review and PR you might want to send and help for question, could not tell you when I would have the time to code it myself ;)