Cloudflare (and maybe other) CDN breaking AMP pages

weeblr commented 8 years ago

Hi, After this discussion, https://productforums.google.com/forum/#!msg/webmasters/_EFTDi1koJo/EAoRom0fIQAJ, Tomo T suggested we open an issue here.

It's a known issue (#1662) that Cloudflare (and maybe other) are breaking AMP pages by inserting on the fly some javascript that do not comply with AMP specs.

Tomo T mentioned you guys are in touch with cloudflare, so maybe resolution is on its way, but in our wbAMP plugin for Joomla, we did the following: 1 - added no-cache headers 2 - Added a X-amphtml: wbAMP header, so that users can setup rules to bypass those pages in their CDN control panel for instance.

Is there any merit to this?

Rgds

Gregable commented 8 years ago

I'm not an expert on cloudflare, but I went in and played around with the settings on a free account to see what cloudflare is doing. I think the feature which breaks AMP validation is called Rocket Loader. It seems, among other changes, to insert into every HTML document head the following snippet:

<script type="text/javascript">
//<![CDATA[
[Cloudflare javascript loader code here]
//]]>
</script>

This loads javascript from cloudflare's site, which isn't part of the AMP runtime and this is invalid for AMP. From my understanding of Rocket Loader's description, valid AMP pages won't be improved by Rocket Loader as they already load javascript asynchronously. As a workaround you can use today, you can disable Rocket Loader for amp pages.

The Rocket Loader feature appears to be under the "Speed" heading in cloudflare's web console, but it would probably be better to configure it just for AMP pages by setting up a specific "Page Rule" that disables Rocket Loader only for the section of your site that is AMP.

weeblr-dev commented 8 years ago

Thanks for your reply. We had seen the JS being inserted, but it's good to know this particular feature is involved. I will pass the information. Nevertheless, the question stands whether it could be useful to add a specific header to amp page so that his process could be integrated by CDN or proxies?

Rgds

jmadler commented 8 years ago

I've let cloudflare know as well. Generally CDNs could detect the doc as AMP HTML and avoid the JS insert (and, perhaps, do AMP specific things like image creation and srcset population).

weeblr commented 8 years ago

Hi there,

@Gregable Our user disabled Rocker Loader on his site, and that solved the validation issue. However, it appears the Rocket Loader feature is something that can only be enabled/disabled for an entire subdomain, not using Page Rules, ie you can't exclude only some URLs.

So only solutions are a/ drop it entirely, or b/ move AMP pages to a subdomain, which is a bit involevd.

I would assume Cloudflare would want to be able to identify AMP pages and disable their plugin then. As @jmadler said, they can just look at the amp attribute in the html tag?

I still think a header or something that doesn't require looking at the actual page content to recognize an AMP page would be helpful in the real world.

Rgds

buro9 commented 8 years ago

Hi, I work for CloudFlare but am not working in this specific area, it's been raised internally and this issue is being tracked and we'll look to get a fix out as soon as possible. If I have news on this, I'll update this bug. Feel free to ping me for updates on this.

buro9 commented 8 years ago

For now, a CloudFlare customer will want to disable:

Speed > Mirage
Speed > Rocket Loader
Scrape Shield > Email Obfuscation

Those can be disabled using page rules, such that you can leave the features enabled for the rest of the content served.

We'd recommend leaving the following enabled for AMP:

Speed > Auto Minify
Speed > Polish

eperezf commented 8 years ago

@buro9 I did everything via a page rule. The problem is still there. Using Wordpress and CloudFlare. "The attribute 'type' in tag 'script type=application/ld+json' is set to the invalid value 'text/javascript'." is the only error left. Is there any waiting time for the settings to apply?

buro9 commented 8 years ago

@eperezf I've raised that as a bug ticket internally. I am curious, does disabling HTML minification resolve that?

Settings are usually applied globally within seconds.

dknecht commented 8 years ago

@jmadler I work for CloudFlare and would be interested to chat about ways to collaborate and potentially add AMP specific logic to our edge.

eperezf commented 8 years ago

@buro9 I think HTML Minification is not the issue here as the problem is the script type "text/javascript" inserted in the section. Just like Google Analytics, it should be formatted as "ld+/json" maybe.

I just checked again the validator and the error changed to "The tag 'script' is disallowed except in specific forms". Maybe if the script can be loaded form an external file instead of being inserted on the head section it may work.

dvoytenko commented 8 years ago

No, external file will not be allowed to be loaded either.

eperezf commented 8 years ago

It can using <script async src="">. At least it does on my website. The problem is inline javascript from Cloudflare using <script type="text/javascript">.

dvoytenko commented 8 years ago

On the main document, <script async src> is only allowed to the AMP runtime or extensions from cdn.ampproject.org.

dknecht commented 8 years ago

Can CloudFlare assume that an AMP Crawler will always have '(compatible; Google-AMPHTML)' in the user agent or at least AMPHTML?

jmadler commented 8 years ago

No, you'll need to check for the presence of "AMP" or "⚡" in the tag, case insensitive.

jmarantz commented 8 years ago

FWIW, mod_pagespeed has a parallel set of issues. We are working on solving those now, and you can follow our progress here: https://github.com/pagespeed/mod_pagespeed/issues/1263

The fact that we have to sniff content to tell something is an AMP document complicates the solution somewhat, particularly when streaming HTML. But it is all do-able (and in fact has already been done, just not all code-reviewed yet).

buro9 commented 8 years ago

That is the root of the issue we have too. For the most part we do not parse or otherwise look at the content of HTML that passes through CloudFlare. Some features do in fact require a parser, but many do not and up to now could safely be added to a HTML document without parsing or otherwise inspecting the HTML document content.

For performance you may imagine that we only parse that which we absolutely must parse, which is a subset of all text/html documents that pass through CloudFlare.

It would be great if there were a header that declared an AMP document, the content-type would be best for this and currently most AMP pages return text/html which does not identify them as AMP. Something like text/html+amp would be great but probably has it's own issues (many legacy things that only expect text/html and nothing else).

We will probably also move to sniffing the html tag, but until then the advice for AMP users will be to disable features that are not AMP aware.

weeblr commented 8 years ago

@buro9 This is what I suggested in the OP, but indeed I didn't think a separate content type value would be practical. Hence might be a good idea to have a separate header? Right now we add:

X-amphtml: wbAMP

as wbAMP is our extension to produce AMP-compliant page from Joomla pages, but obviously something more generic would be a better fit.

I also believe that having to actually parse (though in a limited way) the actual content to decide whether to leave it alone or process it is a major overhead, and will prevent adoption.

jmarantz commented 8 years ago

I would think that the features that CloudFlare offers that might break AMP are a proper subset of the ones that have to parse HTML anyway, but I have to admit I don't understand fully how rocket-loader works.

In any case I agree it would've been a lot more convenient for all involved if amp documents were required by the spec to self-identify with an http response header.

weeblr commented 8 years ago

I would think that the features that CloudFlare offers that might break AMP are a proper subset of the ones that have to parse HTML anyway, Pretty inefficient to have to do the parsing to discover that you won't be able to the use the parsed content because this is an AMP page. Better look at a header, to know that a specific set of features won't apply, and you don't need to parse (or apply whatever processing you want to apply)

dvoytenko commented 8 years ago

I think "parse" is a bit of a strong word here. It's just a very top header of the page that needs to be inspected. I don't think we can require all publishers to include a header for all amp pages - that would seem to be error-prone and duplicative.

Gregable commented 8 years ago

Would it help if we added validation checks that the tag is within the first maybe 100 bytes of the document? Other than comments, it's already required to be the 2nd tag on the page, but comments and whitespace make it possible to be an arbitrary length from the top.

jridgewell commented 8 years ago

Why not just ban comments until after the opening html tag? Then it's guaranteed to be before the second > character.

buro9 commented 8 years ago

Within the first 100 bytes is definitely going to be of great help. That's enough to allow us to decide what to enable/disable, whilst allowing us to send headers as early as possible.

This at least allows this to be a sniffer and not a parser and the constraint allows it to be tightly defined which is good for security (mitigates resource exhaustion during a DDoS).

Content-type is still preferable, but a 100 byte limit, whereby the AMP string or lightning rune must fully complete within the first 100 bytes would be good.

On Wed, 16 Mar 2016, 21:42 Justin Ridgewell, notifications@github.com wrote:

Why not just ban comments until after the opening html tag? Then it's guaranteed to be before the second > character.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ampproject/amphtml/issues/2380#issuecomment-197560035

jmarantz commented 8 years ago

The first 100 bytes might be tight. E.g. from https://en.wikipedia.org/wiki/Document_type_declaration

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html lang="ar" dir="ltr" xmlns="http://www.w3.org/1999/xhtml">

On Wed, Mar 16, 2016 at 5:54 PM, David Kitchen notifications@github.com wrote:

Within the first 100 bytes is definitely going to be of great help. That's enough to allow us to decide what to enable/disable, whilst allowing us to send headers as early as possible.

This at least allows this to be a sniffer and not a parser and the constraint allows it to be tightly defined which is good for security (mitigates resource exhaustion during a DDoS).

Content-type is still preferable, but a 100 byte limit, whereby the AMP string or lightning rune must fully complete within the first 100 bytes would be good.

On Wed, 16 Mar 2016, 21:42 Justin Ridgewell, notifications@github.com wrote:

Why not just ban comments until after the opening html tag? Then it's guaranteed to be before the second > character.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub < https://github.com/ampproject/amphtml/issues/2380#issuecomment-197560035>

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/ampproject/amphtml/issues/2380#issuecomment-197567339

buro9 commented 8 years ago

I'm new to AMP, but I was under the impression that the DOCTYPE was dictated as HTML5: https://www.ampproject.org/docs/reference/spec.html

If not, and any valid DOCTYPE is acceptable then something more like 256 bytes would be better, though we've just doubled the proposed sniffer requirement and I'm thinking the content-type is still preferable.

jridgewell commented 8 years ago

The first 100 bytes might be tight.

Not an issue, we mandate <!doctype html> (HTML5) doctype.

Gregable commented 8 years ago

Content-type would be nice, and could be recommended, but plenty of content creators don't have the ability or the know-how to modify headers, therefore I think we will want to support tags.

We do mandate <!doctype html> for a valid doc currently. Nobody has proposed supporting DTDs yet, though if we want them we should resolve that asap.

The value of 100 was a strawman. I like 256 even more. I could try to get an estimate of what we're seeing in the wild, but given that DTDs are not allowed now, it might not be a good metric to look at.

Not allowing comments would be possible, but is tricky for some validation implementations if they run on top of an HTML parser which doesn't have callbacks for comments. We'd also definitely need to look at this in the wild, there may be some CMS systems that need a comment header for reasons. tl;dr: needs more research.

How about 256? And to be specific, I mean the trailing > in the <html amp> tag must be the 256'th byte or lower.

buro9 commented 8 years ago

256 bytes is good, I'll run the proposal internally tomorrow. We'd not stop asking for the content-type (especially as there is incompatibility with a user agents expectation of what text/html means and what it's about to receive), but lacking that... this constraint is a good one.

I mean the trailing > in the tag must be the 256'th byte or lower.

Agreed.

We would expect to implement a sniffer that would exit once it encounters either the closing > on the HTML tag, or 256 bytes. Whichever is sooner.

dvoytenko commented 8 years ago

I think if we scan 256 bytes - might as well do 1000 bytes - that doesn't change anything but would remove most of edge cases.

dvoytenko commented 8 years ago

Also, i think it'd be just as easy to wait until the first > to do the check - it's common for server-side filters to look in the <!doctype> and <html> declarations.

buro9 commented 8 years ago

Please no.

1000 bytes is excessive. We serve a lot of pages and need to think of both performance and security. 256 is sufficient for performance whilst also mitigating some of the risk of resource exhaustion.

We definitely wouldn't wait for the first > , what if someone serves a 900MB video file as text/html and we're sitting there buffering it all during a sniff operation in anticipation of a character that won't arrive. People do this stuff every day, best not to ask why they do and accept that they do.

If we are going to sniff all text/html (something we're not doing today) and cannot just use a content-type (the right solution), we really want well defined rules for this. First 256 bytes to include the closing HTML tag, and contains either AMP or lightning emoji - this is a good rule and not incompatible with anything in the AMP spec. It just clarifies and defines something currently ambiguous.

There is no need to look further into the document, we just want to know very quickly whether this is an AMP doc or not.

The issue that mod_pagespeed also describe on their linked issue above with http-equiv meta refresh is similar... they would want to detect AMP very early so that they could adjust the HTTP headers if needed. Once they have gone beyond the sniff operation they (like us) want to be streaming that content and not holding it in a buffer.

All of us who provide reverse proxies, transparent proxies, transformation proxies, need to be able to be sure that we can identify AMP extremely early, within the scope of a single tiny buffer.

jmadler commented 8 years ago

Let's first agree on an approach and then discuss content-type and scan maximum size.

My concern with the approach of "closing caret of html tag must be within 256 bytes and html tag must contain 'AMP' or '⚡', case insensitive" is that it may not be as performant as typical approaches for identifying content.

What about something like "'<html AMP' or '<html ⚡' must appear in the first N bytes, case insensitive, whitespace compressed". This falls in line with typical rulesets for magic words. However, it does mean that is invalid.

Does that approach make sense?

jmarantz commented 8 years ago

I am aligned with Cloudflare's concerns about the tag being in the first 256 bytes. The in-flight mod_pagespeed implementation doesn't have that limit but probably it should, to reduce the risk of unbounded buffer growth.

I don't think it makes a lot of sense to compress whitespace when computing that window. Either the proxy has to first the sniff N bytes or it has to do something more complicated.

I am not a fan of requiring that exact byte sequence "<html AMP". In all other cases, HTML attributes are not order sensitive so I don't think the standard should dictate that this one is order sensitive either.

I also think that if someone puts:

in the first 256 bytes, that should not be recognized as an AMP document. So just scanning for that byte sequence in the first 256 bytes seems wrong to me.

FWIW here is our open-sourced amp-recognition filter, currently insensitive to byte-count. Outside this context, MPS sniffs all content-type:text/html input, expecting the first non-whitespace character to be a '<'. This implementation ignores all whitespace and comments, but will give up at the first tag other than <doctype ...> or simply

, as well as at the first non-whitespace character. https://github.com/pagespeed/mod_pagespeed/blob/master/pagespeed/kernel/html/amp_document_filter.cc We can easily enforce the 256-character limit outside the context of this filter by only letting this filter run on the first 256 bytes of text we see, Our full HTML lexer runs in front of this, but I don't think that'll be a speed issue.

rudygalfi commented 8 years ago

Just wanted to see if folks had any updates to share here?

uraayush commented 8 years ago

There is https and http issue in AMP page as While validation AMP page it only accept amp js on https not on http How to resolve that issue. The mandatory tag 'amphtml engine v0.js script' is missing or incorrect. (see https://www.ampproject.org/docs/reference/spec.html#required-markup)

dvoytenko commented 8 years ago

/cc @Gregable

Gregable commented 8 years ago

@uraayush, I think that this is separate from the cloudflare issue. My understanding is that our general recommendation for these types of questions is to ask over at the amphtml component at stackoverflow. I'll still try to answer it.

I suspect you are including the mandatory script tag as: <script async src="http://cdn.ampproject.org/v0.js"></script>, ie: as an http URL It must instead always be included as an https URL, which is fine to do from an http document, so: <script async src="https://cdn.ampproject.org/v0.js"></script>

jmarantz commented 8 years ago

Note: the canonical minimal valid AMP html document has an Apache 2.0 copyright comment at the top:

https://github.com/ampproject/amphtml/blob/master/validator/testdata/feature_tests/minimum_valid_amp.html

the html lightning-bolt tag appears at about character 779 in that file (more or less..I'm not sure how Emacs counts utf8 chars :)

rhising commented 8 years ago

I use CloudFlare on a wordpress site and it didn't work until i added a page rule in CloudFlare. I essentially turned off everything under the pattern www.domain.com//amp/

It worked for me so hopefully it will help someone else.

rudygalfi commented 8 years ago

I vote for a 256 byte limit. Who owns the next action? Is that validator enforcement/warning? cc @Gregable

Gregable commented 8 years ago

Thanks for pinging this. I chatted with @dvoytenko and I think we also agreed to a 256 byte limit, but decided that it wouldn't be something the validator would enforce, but instead we should just document it as a suggestion / possibly make it a validator warning.

weeblr commented 8 years ago

Hi all, Glad to see the discussion has progressed, through I'm not sure where Cloudflare actually stands right now. Anyway, this is just to mention that similar problem appears with NewRelic, which PHP agent (I assume the same with other agents, just haven't dealt with anything else yet) also insert script tags into the page for some of their features. There is no current fix available, short of disabling the NewRelic agent from the CMS/application generating the AMP page.

cheers

RReverser commented 8 years ago

Hi, yet another CloudFlare developer here :) As of the moment, AMP pages (any pages with <html amp> or <html ⚡>) should be excluded from our transformations. Please let me know if you still experience issues and will be happy to look into it.

weeblr-dev commented 8 years ago

Hi @RReverser

Thanks for updating us. Can you confirm the presence of the AMP tag anywhere on the page is enough, or is there a 256 or more first-characters-in-page limit to be aware of, as mentioned earlier on this issue?

Thanks

RReverser commented 8 years ago

@weeblr AMP page, unlike generic HTML, has strict rules in regard to HTML structure, so I went ahead and implemented it without any limit (that is, any valid AMP page should be excluded from the transformations). Note that this code was just recently deployed to production endpoints, but as far as I know, should be already available on all of them.

venkatgig commented 8 years ago

@RReverser Still facing the issue. Except those script tags added by Cloudflare, HTML is a valid AMP HTML. Is there any other configuration needs to be changed in cloudflare side?

The following code has been added to the bottom of my page.

<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","licenseKey":"xxxxxx","applicationID":"xxxx","transactionName":"xxxxxxxxxxxx==","queueTime":0,"applicationTime":193,"ttGuid":"","agentToken":"","atts":"xxxxxxxx=","errorBeacon":"bam.nr-data.net","agent":""}</script>

The following code at the top of my page.

weeblr commented 8 years ago

@venkatgig @RReverser As can be seen in the script itself, this is not coming from Cloudflare but instead from NewRelic. Per my previous message, the only fix is to disable the NewRelic agent for the page, which can be done programmatically only, AFAIK

venkatgig commented 8 years ago

@RReverser @weeblr Sorry guys, my bad. Weeblr thanks for pointing out.

RReverser commented 8 years ago

@weeblr Right, thanks!

ampproject / amphtml

Cloudflare (and maybe other) CDN breaking AMP pages #2380