berkmancenter / cache-link

Specifications for marking up cached copies of hyperlink targets in HTML.
2 stars 3 forks source link

comma separator for URIs leads to parsing issues #14

Open karlcow opened 10 years ago

karlcow commented 10 years ago

I have noticed that in the current specification (Editor’s Draft, 13 May 2014) of "The mset Attribute".

If present, its value must consist of one or more reference candidates, each separated from the next by a "," (U+002C) COMMA character.

The comma character is an authorized character for URIs. See the URL Spec.

The URL code points are ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.

For example a well-known website using commas in URIs

http://www.w3.org/,tools
ryanttb commented 10 years ago

Thanks for the clarification! I originally thought they would have to be percent-encoded but I see it now.

A path segment must be zero or more URL units, excluding "/" and "?". The URL units are URL code points and percent-encoded bytes.

I was taking inspiration from the srcset attribute which, as far as I can tell, separates image candidate strings with commas where the only required part of an image candidate string is a URL.

http://www.w3.org/html/wg/drafts/srcset/w3c-srcset/#image-candidate-string

Please feel free to suggest cleaner ways to embed the info.

ryanttb commented 10 years ago

We could go with a space-separated solution as the ping attribute already on the a element has set precedent ( http://www.whatwg.org/specs/web-apps/current-work/multipage/links.html#links-created-by-a-and-area-elements ) but that removes the ability for the author to provide the relationship and a date of each copy link.

This information can be retrieved from the cache server if we require all mset URLs to point to servers implementing the Memento specification: http://www.mementoweb.org/news/node/49

Currently that means only Internet Archive and any MediaWiki with the Memento extension but in the near future, Perma.cc and Drupal/WordPress (with the Internet Robustness/Amber plugin).

Thoughts?