When running a recursive crawl with --no-parent for https://example.org/~foo/, links to https://example.org/%7Efoo/bar are not followed (and vice-versa) because there is no normalisation of ~ and %7E to either value. I think this should be considered a bug. I assume a similar thing might be true for other characters but have only seen the tilde in the wild.
I'm not entirely sure about the correct solution here. We could force it to either value (probably the encoded one to be safe as some ancient servers might not support literal tildes, cf. RFC 1738). This would change the URL and might in some very rare cases cause issues. The alternative is to keep URLs as is but do an equivalence check. This would however require extra handling for deduplication of equivalent URLs, and I'm not sure there is a good way to do that (which doesn't involve e.g. a separate DB column for a normalised URL).
When running a recursive crawl with
--no-parent
for https://example.org/~foo/, links to https://example.org/%7Efoo/bar are not followed (and vice-versa) because there is no normalisation of~
and%7E
to either value. I think this should be considered a bug. I assume a similar thing might be true for other characters but have only seen the tilde in the wild.I'm not entirely sure about the correct solution here. We could force it to either value (probably the encoded one to be safe as some ancient servers might not support literal tildes, cf. RFC 1738). This would change the URL and might in some very rare cases cause issues. The alternative is to keep URLs as is but do an equivalence check. This would however require extra handling for deduplication of equivalent URLs, and I'm not sure there is a good way to do that (which doesn't involve e.g. a separate DB column for a normalised URL).
I haven't checked what wget does in this case.