Open davidfstr opened 2 years ago
In theory sites may advertise for each page - using the ETag and Last-Modified HTTP headers - whether it has updated content vs. when it was last fetched by a browser. But there are many issues:
For example the xkcd site - which is good enough to generate an ETag at all - makes an ETag that looks like "62e1f036-1edc"
:
Sites may have a sitemap.xml that they generate for search engines. These sitemaps do provide a per-page "last modified" date, but it is common for a sitemap to set ALL pages to the same latest date that the entire site was regenerated.
Sites may have an RSS/Atom feed that references post-like pages on the site. The "post date" given for a particular page usually is a stable value that maps to when a post-like page was created. However that date is frequently not updated if an existing post-like page's content is modified.
Since Crystal's inception it's been a long-term design goal to just not just allow websites to be downloaded once, but to support being downloaded multiple times at different points in time.
For example: It should be possible for me to download the entire xkcd site (with all comics) today, and then next week efficiently download the new set of comics that have appeared since then.
Designing an easy-to-use workflow to support this scenario is not trivial. Many challenges are involved.
Related Tasks:
--stale-before
date(time) such that all resources downloaded before that date(time) are considered stale† "Manual workflow v1" is problematic:
Therefore "Manual workflow v2" has been introduced as an alternative instead.