apache / incubator-annotator

Apache Annotator provides annotation enabling code for browsers, servers, and humans.
https://annotator.apache.org/
Apache License 2.0
220 stars 44 forks source link

Document Identity Determination? #50

Open BigBlueHat opened 5 years ago

BigBlueHat commented 5 years ago

Curious to get thoughts from everyone on whether having document identification determination code would be useful for this project.

By "document identification determination" I mean the process of sorting out which one (or more!) identifiers should be stored as the target.

For instance:

GET /?utm_source=twitter&utm_medium=social
Host: http://example.com/
<html>
<head>
  <base href="http://cdn.example.com/">
  <link rel="canonical" href="http://www.example.com/">
  <link rel="latest-version" href="index.html">
  <link rel="working-copy" href="newer.html">
  <link rel="ogp:url" href="https://www.example.com/">
  <link rel="schema:url" href="https://www.example.com/index.html">
</head>

The ?utm_ prefixed query param are typical marketing-bot tracking thingies. The canonical rel is from https://tools.ietf.org/html/rfc6596 The latest-version and working-copy rel's are from https://tools.ietf.org/html/rfc5829 The ogp:url is from http://ogp.me/ The schema:url is from http://schema.org/

At some level all (or most) of these are the same (presumably πŸ˜‰). However, determining their "sameness" is outside of the scope of an annotation tool (I'd reckon), but storing the right one (or more) is mandatory for the annotation to make sense.

What I'm wondering is if we should provide a basic retrieval mechanism for determining the existence and potential value of them to the annotation. At the very least it would be handy to get back a list of all stated identifiers for the current document.

Real world scenario (which I just tripped over) is W3C Editorial Draft specs with GitHub URLs (or hosted locally) have their future Technical Recommendation (TR) URLs set as the rel="canonical" (which is injected by ReSpec post-page loading). Consequently, annotating the Verifiable Claims Data Model is hampered if only the canonical URL is stored (because it's not yet hit TR).

It's that "other" part of annotation creation that's so fun. 😁

πŸ’­'s?

tilgovi commented 5 years ago

I have long thought that mozilla/fathom would be an interesting tool to use for such things.