davidfstr / Crystal-Web-Archiver

Downloads websites for long-term archival.
http://dafoster.net/projects/crystal-web-archiver
60 stars 5 forks source link

Support incremental redownload of sites with new page versions #80

Open davidfstr opened 2 years ago

davidfstr commented 2 years ago

Since Crystal's inception it's been a long-term design goal to just not just allow websites to be downloaded once, but to support being downloaded multiple times at different points in time.

For example: It should be possible for me to download the entire xkcd site (with all comics) today, and then next week efficiently download the new set of comics that have appeared since then.

Designing an easy-to-use workflow to support this scenario is not trivial. Many challenges are involved.

Related Tasks:


† "Manual workflow v1" is problematic:

Therefore "Manual workflow v2" has been introduced as an alternative instead.

davidfstr commented 2 years ago

Site-provided staleness/freshness signals

ETag & Last-Modified Challenges

In theory sites may advertise for each page - using the ETag and Last-Modified HTTP headers - whether it has updated content vs. when it was last fetched by a browser. But there are many issues:

For example the xkcd site - which is good enough to generate an ETag at all - makes an ETag that looks like "62e1f036-1edc":

Sitemap Challenges

Sites may have a sitemap.xml that they generate for search engines. These sitemaps do provide a per-page "last modified" date, but it is common for a sitemap to set ALL pages to the same latest date that the entire site was regenerated.

RSS/Atom Challenges

Sites may have an RSS/Atom feed that references post-like pages on the site. The "post date" given for a particular page usually is a stable value that maps to when a post-like page was created. However that date is frequently not updated if an existing post-like page's content is modified.