ampproject / amphtml

The AMP web component framework.
https://amp.dev
Apache License 2.0
14.89k stars 3.89k forks source link

AMP Cache throws error when querystring params are passed on certain domains #19892

Closed samparnell closed 5 years ago

samparnell commented 5 years ago

What's the issue?

This is specific to the viewer version of the amp cached page.

On some domains the amp cache returns a "Sorry, this page is not valid AMP HTML" result for a valid page if additional querystring params are passed on the url. The following example on yahoo behaves differently if the querystring is ...?amp_js_v=0.1 vs. ...?amp_js_v=0.1&a=b.

This was the first example I was able to replicate on - seems to be the same problem for all Yahoo Sports pages. I have seen a similar result before for other domains and will try to reproduce elsewhere. This came us as we pass utm params for google analytics to partners. It seems that any additional parameter causes the issue though.

How do we reproduce the issue?

This url works: https://www-yahoo-com.cdn.ampproject.org/v/s/www.yahoo.com/amphtml/sports/bucks-cavaliers-preview-033542929--nba.html?amp_js_v=0.1#origin=https%3A%2F%2Fdistributedmedialab.com&cap=swipe%2CnavigateTo%2Cfragment%2CerrorReporter&horizontalScrolling=0&prerenderSize=1&visibilityState=visible

This one doesn't: https://www-yahoo-com.cdn.ampproject.org/v/s/www.yahoo.com/amphtml/sports/bucks-cavaliers-preview-033542929--nba.html?amp_js_v=0.1&a=b#origin=https%3A%2F%2Fdistributedmedialab.com&cap=swipe%2CnavigateTo%2Cfragment%2CerrorReporter&horizontalScrolling=0&prerenderSize=1&visibilityState=visible

Same params, different domain works fine: https://hoopshype-com.cdn.ampproject.org/v/s/hoopshype.com/2018/12/14/nba-power-rankings-raptors-warriors-thunder-celtics/amp/?amp_js_v=0.1&a=b#origin=https%3A%2F%2Fdistributedmedialab.com&cap=swipe%2CnavigateTo%2Cfragment%2CerrorReporter&horizontalScrolling=0&prerenderSize=1&visibilityState=prerender

torch2424 commented 5 years ago

Triaging to @Gregable 😄

Gregable commented 5 years ago

The query params are simply passed through when requesting the document from the publisher:

https://www.yahoo.com/amphtml/sports/bucks-cavaliers-preview-033542929--nba.html?a=b

That page doesn't have an amp attribute on it's <html> tag when fetched by the google crawler, so it's invalid. This is a publisher issue, not an AMP cache issue.

The URL you linked to (https://www-yahoo-com.cdn.ampproject.org/v/s/www.yahoo.com/amphtml/sports/bucks-cavaliers-preview-033542929--nba.html?amp_js_v=0.1&a=b#origin=https%3A%2F%2Fdistributedmedialab.com&cap=swipe%2CnavigateTo%2Cfragment%2CerrorReporter&horizontalScrolling=0&prerenderSize=1&visibilityState=visible) also gives you the error codes that the validator sees, which is how I debugged this.

samparnell commented 5 years ago

@Gregable I'm not sure that explains why the standard amp cache version (/c/s/) works but the viewer version (/v/s/) does not. They should either both work or both throw an error right?

I should note that the viewer url I originally posted with the "a=b" param now does work in the amp cache. But if the param is changed to a=c it does not. It seems like the a=b version is now cached and working but if the param value is changed it fails to retrieve the page. (https://www-yahoo-com.cdn.ampproject.org/v/s/www.yahoo.com/amphtml/sports/bucks-cavaliers-preview-033542929--nba.html?amp_js_v=0.1&a=b#origin=https%3A%2F%2Fdistributedmedialab.com&cap=swipe%2CnavigateTo%2Cfragment%2CerrorReporter&horizontalScrolling=0&prerenderSize=1&visibilityState=visible)

I agree that the underlying page is missing the amp tag in the html element (not my page so out of my control). If that is not a valid page it shouldn't be in the amp cache in the first place right?

Gregable commented 5 years ago

The cache itself will fetch anything you ask it to, it just won't serve invalid documents.

Google search results shouldn't include invalid pages.

Note that all of these have a timing element to them, moreso since results can be cached. Perhaps at one point in the past, a request succeeded, and that was cached. The cache won't remove a valid document to make room for an invalid one, for example.

I expect all of this has to do with instability on the publisher (yahoo's) origin.