Open carlgieringer opened 1 year ago
When I backfilled URL-normalization, I did so with a version of normalizeUrl that always appended a slash to the path if it was missing. This normalized index.html
to index.html/
which is not what we want. I had missed this caveat from https://en.wikipedia.org/wiki/URI_normalization#Normalization_process:
However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.
We should re-run URL normalization without this mistake. We should first probably introduce a URL and normalized URL to help with bugs like this in the future, in case we lose information in the normalization.
Fixed in #567
https://github.com/Howdju/howdju/pull/492 added URL normalization and the requesting of canonical URLs. We should backfill these procedures to existing URLs:
url
tonormalizeUrl(url)
.canonical_url_confirmations
. Request the confirmation if none is present.See also #496.