Howdju / howdju

Monorepo for the Howdju crowdsourced fact checking and summarization platform
https://www.howdju.com
GNU Affero General Public License v3.0
5 stars 2 forks source link

Backfill URL normalization and canonicalization #494

Open carlgieringer opened 1 year ago

carlgieringer commented 1 year ago

https://github.com/Howdju/howdju/pull/492 added URL normalization and the requesting of canonical URLs. We should backfill these procedures to existing URLs:

See also #496.

carlgieringer commented 1 year ago

When I backfilled URL-normalization, I did so with a version of normalizeUrl that always appended a slash to the path if it was missing. This normalized index.html to index.html/ which is not what we want. I had missed this caveat from https://en.wikipedia.org/wiki/URI_normalization#Normalization_process:

However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.

We should re-run URL normalization without this mistake. We should first probably introduce a URL and normalized URL to help with bugs like this in the future, in case we lose information in the normalization.

Fixed in #567