ampproject / amphtml

The AMP web component framework.
https://amp.dev
Apache License 2.0
14.89k stars 3.89k forks source link

Allow rel="canonical" on A as well as LINK #596

Closed adactio closed 8 years ago

adactio commented 9 years ago

Currently the AMP documentation requires a link rel="canonical" in the head of the document. But there are two linking mechanisms that can accept rel values: link and a.

How about allowing rel="canonical" to sit on a elements as well?

Here's an example of an AMP HTML document with both: https://adactio.com/journal/9646/amp

That page currently fails validation because it has rel="canonical" on an a element that links back to the regular HTML version.

cramforce commented 9 years ago

@powdercloud I think we should allow arbitrary rel values on a. Did we only whitelist some so far or allow none? nofollow and noreferrer are also needed there.

Gregable commented 9 years ago

We whitelisted rel= values. We had a discussion about this offline. The general issue is that some rel= values trigger client behavior that we don't want AMP pages to have. The following values were of particular concern: import, components, prerender, subresource, preconnect, prefetch, preload. Since there isn't a clear standardized list of these and different clients may have different supported values or may add some in the future, we went with a whitelist approach. I manually went through and tried to be as exhaustive as possible with whitelisting things though. Here's what we have for link:

             "accessibility|"
             "alternate|"
             "archives?|"
             "attachment|"
             "appendix|"
             "author|"
             "bibliography|"
             "category|"
             "chapter|"
             "cite|"
             "comment|"
             "contents|"
             "contribution|"
             "copyright|"
             "directory|"
             "discussion|"
             "DCTERMS\\..*|"  # Dublin Core Metadata
             "EditURI|"
             "endorsed|"
             "fan|"
             "feed|"
             "footnote|"
             "glossary|"
             "group|"
             "help|"
             "home|"
             "hub|"
             "in-reply-to|"
             "index|"
             "indieauth|"
             "issues|"
             "its-rules|"
             "jslicense|"
             "license|"
             "made|"
             "map|"
             "me|"
             "meta|"
             "micropub|"
             "microsummary|"
             "next|"
             "nofollow|"
             "noreferrer|"
             "openid2?\\..*|"  # OpenID Authentication
             "p3pv1|"
             "payment|"
             "pingback|"
             "pgpkey|"
             "privacy|"
             "pronounciation|"
             "profile|"
             "publisher|"
             "prev|"
             "previous|"
             "referral|"
             "related|"
             "replies|"
             "reply-to|"
             "search|"
             "section|"
             "service|"
             "shortlink|"
             "source|"
             "sidebar|"
             "sitemap|"
             "sponsor|"
             "start|"
             "status|"
             "syndication|"
             "subsection|"
             "timesheet|"
             "toc|"
             "token_endpoint|"
             "trackback|"
             "transformation|"
             "unendorsed|"
             "user|"
             "vcalendar-parent|"
             "vcalendar-sibling|"
             "webmention|"
             "yandex-tableau-widget"

Along with special rules for stylesheet, amphtml, and canonical.

The a tag is similar with support for:

             "accessibility|"
             "author|"
             "alternate|"
             "appendix|"
             "alternate|"
             "archived|"
             "archives?|"
             "attachment|"
             "appendix|"
             "author|"
             "bookmark|"
             "bibliography|"
             "category|"
             "chapter|"
             "cite|"
             "colleague|"
             "comment|"
             "contact|"
             "contents|"
             "contribution|"
             "copyright|"
             "directory|"
             "discussion|"
             "EditURI|"
             "endorsed|"
             "fan|"
             "feed|"
             "footnote|"
             "glossary|"
             "group|"
             "help|"
             "home|"
             "hub|"
             "in-reply-to|"
             "index|"
             "indieauth|"
             "issues|"
             "its-rules|"
             "jslicense|"
             "license|"
             "made|"
             "map|"
             "me|"
             "meta|"
             "micropub|"
             "microsummary|"
             "next|"
             "nofollow|"
             "noreferrer|"
             "openid2?\\..*|"  # OpenID Authentication
             "p3pv1|"
             "payment|"
             "pingback|"
             "pgpkey|"
             "privacy|"
             "pronounciation|"
             "profile|"
             "publisher|"
             "prev|"
             "previous|"
             "referral|"
             "related|"
             "replies|"
             "reply-to|"
             "search|"
             "section|"
             "service|"
             "shortlink|"
             "source|"
             "sidebar|"
             "sitemap|"
             "sponsor|"
             "start|"
             "status|"
             "syndication|"
             "subsection|"
             "tag|"
             "timesheet|"
             "toc|"
             "token_endpoint|"
             "trackback|"
             "transformation|"
             "unendorsed|"
             "user|"
             "vcalendar-parent|"
             "vcalendar-sibling|"
             "webmention|"
             "yandex-tableau-widget"

These lists were generated by looking at http://microformats.org/wiki/existing-rel-values

Canonical isn't allowed for a since that resource mentions that it isn't allowed for a.

This is kind of an ugly part of the validator because it's a value set that's not well standardized but also has potential behavioral affects in clients. I'd strongly prefer not to go the blacklist route as a result.

Gregable commented 9 years ago

Also, as I mentioned in https://github.com/ampproject/amphtml/issues/498, we don't currently allow whitespace separation, which is something we need to fix, especially for a tags.

adactio commented 9 years ago

Canonical isn't allowed for a since that resource mentions that it isn't allowed for a.

That's definitely an error in the registry documentation. Sorry about that. I'll get that changed.

Gregable commented 9 years ago

Ah, cool. @cramforce , this might need some more thought for the amphtml spec. Currently the spec requires a <link rel=canonical> tag in the <head> somewhere. Would we accept an a tag instead? Do the hrefs need to be the same, etc.

Out of curiosity, what does it mean for an a tag to have rel=canonical? Is it something like: yadda yadda <a href='/foo.html' rel='canonical'>canonical page here</a> yadda I'm unsure if search engines support this.

cramforce commented 9 years ago

Unless we get good signal from Bing, Yandex, Google, etc. that <a href='/foo.html' rel='canonical'> is supported well I think keeping the link tag requirement is important.

adactio commented 9 years ago

@cramforce You can totally keep the link element requirement, but all I'm asking that is that the rel value of "canonical" be allowed on a elements.

cramforce commented 9 years ago

Right, yeah, we will!

On Tue, Oct 13, 2015 at 1:32 PM Jeremy Keith notifications@github.com wrote:

@cramforce https://github.com/cramforce You can totally keep the link element requirement, but all I'm asking that is that the rel value of "canonical" be allowed on a elements.

— Reply to this email directly or view it on GitHub https://github.com/ampproject/amphtml/issues/596#issuecomment-147843795.

adactio commented 9 years ago

The rel registry has been updated:

http://microformats.org/wiki/existing-rel-values#HTML5_link_type_extensions

Apologies once again for the confusion.

tigt commented 9 years ago

Search engines and browsers tend to only act on <link> elements in the <head> because one needs a reasonable amount of control the website to put anything there. Otherwise user-generated content can get up to a lot of mischief: canonical, prerender, a different RSS feed/OpenSearch, etc.

I know of exactly one crawler that bothers to look at rel values on <a>, and that's Piperka. The rest didn't want to start screwing over websites that didn't implement any rel filtering before it became important.

Gregable commented 9 years ago

rel=canonical is now allowed on a as well as link tags. I don't have rights to close this, so, @cramforce to do the honors.

adactio commented 8 years ago

I'd like to re-open this issue please. I don't know if it's a regression or if this slipped through the cracks, but the AMP validator is currently flagging up rel="canonical" on A elements as errors.

The attribute 'rel' in tag 'a' is set to the invalid value 'canonical'.

Example: https://adactio.com/journal/9646/amp#development=1

So this is the experience on Android: https://twitter.com/m_ott/status/749649802036666368

It's like XHTML2 all over again.

erwinmombay commented 8 years ago

/to @Gregable to check for regression

cramforce commented 8 years ago

Lets fix.

@adewale Could you talk to Medium that they keep their links up-to-date. For their type of linking it might be better if they have the cache emit a 302 when the doc is invalid. I think we should look at building that to avoid the XHTML2 mess.

adewale commented 8 years ago

Done.

Gregable commented 8 years ago

Looks like this regressed in March when we switched rel= values from whitelists to blacklists. Will fix and add tests.

Gregable commented 8 years ago

Fixed, will roll out to cdn.ammproject.org soon.

Gregable commented 8 years ago

Should be live on cdn.ammproject.org.