ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
554 stars 77 forks source link

Equivalent but differently encoded URLs break no-parent recursion #469

Open JustAnotherArchivist opened 2 years ago

JustAnotherArchivist commented 2 years ago

When running a recursive crawl with --no-parent for https://example.org/~foo/, links to https://example.org/%7Efoo/bar are not followed (and vice-versa) because there is no normalisation of ~ and %7E to either value. I think this should be considered a bug. I assume a similar thing might be true for other characters but have only seen the tilde in the wild.

I'm not entirely sure about the correct solution here. We could force it to either value (probably the encoded one to be safe as some ancient servers might not support literal tildes, cf. RFC 1738). This would change the URL and might in some very rare cases cause issues. The alternative is to keep URLs as is but do an equivalence check. This would however require extra handling for deduplication of equivalent URLs, and I'm not sure there is a good way to do that (which doesn't involve e.g. a separate DB column for a normalised URL).

I haven't checked what wget does in this case.