edgi-govdata-archiving / web-monitoring-processing

Tools for access, "diff"-ing, and analyzing archived web pages
https://edgi-govdata-archiving.github.io/web-monitoring-processing
GNU General Public License v3.0
20 stars 20 forks source link

Keep track of canonical URLs from page markup in `source_metadata` #728

Open Mr0grog opened 3 years ago

Mr0grog commented 3 years ago

Some pages have a <link rel="canonical" href="{url}"> element in their markup, indicating a correct, “canonical” URL for the page (some more info here: https://en.wikipedia.org/wiki/Canonical_link_element). When importing data from the Wayback Machine, it would be great to include the canonical URL in the source_metadata field.

We already parse HTML pages that we’re importing to get their titles, and we could get the canonical link (if present) at a similar point in the process. Ideally we should create a way to only parse the page once to get title, canonical link, and anything else we might want to extract from the page content in the future.

Where we already parse markup for titles:

https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/21512eb804d3212cb4d4458fbbfb8e3e308628c0/web_monitoring/cli/cli.py#L408-L413

https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/21512eb804d3212cb4d4458fbbfb8e3e308628c0/web_monitoring/utils.py#L98-L112

Mr0grog commented 3 years ago

It might be useful to do this for <link rel="shortlink">, too. I had totally forgotten about that one!