ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
545 stars 77 forks source link

Fails to recurse on some sites when the homepage is a 404 #444

Open JustAnotherArchivist opened 4 years ago

JustAnotherArchivist commented 4 years ago

Recursively crawling https://blog.cyone.ch/ does not work as expected: wpull --recursive --sitemaps https://blog.cyone.ch/ only retrieves the homepage (which is a 404), robots.txt, and the sitemap. In particular, it doesn't follow the URLs mentioned in the sitemap. (It also doesn't extract links on the 404 page, which may be related to #202.)

The 404 alone does not appear to explain this: wpull --recursive --sitemaps https://du-willst-mehr.ch/ does recurse by following the URLs in the sitemap, even though the homepage is also a 404. Its sitemap has sub-sitemaps as opposed to directly URLs, which might play a role.

Both of these were discovered through ArchiveBot, i.e. wpull 2.0.3.