ampproject / amphtml

The AMP web component framework.
https://amp.dev
Apache License 2.0
14.89k stars 3.89k forks source link

amp-live-list update problem with google amp cache #13659

Closed spinmar closed 4 years ago

spinmar commented 6 years ago

Hi, we have a problem in our amp live football soccer page related to google amp cache. We have the live of italian serie A and serie B football games The user can follow the live without refresh the page: in non amp page we handle it with javascript while in the amp one we use the tag amp-live-list with refresh time of 20 seconds. If the user call the amp page directly there is no problem and the refresh is working well: Example.

https://sport.virgilio.it/dirette/live/serie-a/26-2-2018/cagliari-napoli/3835/amp/

While if the user goes to the page from google search results and then from google cache it seems that the page is not refreshing.

https://www.google.it/amp/s/sport.virgilio.it/dirette/live/serie-a/26-2-2018/cagliari-napoli/3835/amp/

Our server replies always with max-age:0 and then google should not cache the page. Google should always ask the page to our server. During the live the amp version of page is always lagging behind not amp page. Can some explain me where is the problem? Thanks

QES commented 6 years ago

I'm not sure what issue you are seeing we have been running this for over 6 months on Sports Mole and it seems to work fine (some past examples can be found at https://amp.sportsmole.co.uk/football/live-commentary/ but only live when matches are live)

We use the amp-live-list element extensively and checked it was working when we first set it up, I haven't checked recently though.

spinmar commented 6 years ago

Well it seems that google amp cache is not updated with real content. If you go with a browser to amp page all is ok: it is very strange that only we see the problem.

erwinmombay commented 6 years ago

@spinmar that's interesting. The google AMP cache should at least be 1 behind the origin content and get's updated once a user hits the cache page. (and is up to date from then on there)

seomaz commented 6 years ago

CORS problem https://www.ampproject.org/docs/guides/amp-cors-requests

QES commented 6 years ago

I have now checked and our pages are also not reloading on the google cache in the timely way they used to:

https://www.google.co.uk/search?ei=FaSVWqjLCYKvgAaM4YFI&q=live+commentary+espanyol+vs+real+madrid&oq=live+commentary+espanyol+vs+real+madrid&gs_l=psy-ab.3..35i39k1j0i30k1j0i8i30k1l3.7521.15040.0.15415.10.10.0.0.0.0.97.790.10.10.0....0...1c.1.64.psy-ab..1.7.541...0i7i30k1j0i8i7i30k1.0.vxSMuHjTLaQ

shows:

image

which goes to

https://www.google.co.uk/amp/s/amp.sportsmole.co.uk/football/real-madrid/live-commentary/live-commentary-espanyol-vs-real-madrid_319694.html

image

The page is uploading quickly but you can't refresh the Google Cache page as you get

image

I'm sure that when I originally tested this last year the behaviour was different.

This will be live for the next 2 hours.

QES commented 6 years ago

This is definitely a CORS issue having looked at a debug session on my phone:

image

My current headers are delivered by:

header("Access-Control-Allow-Origin: $HOST"); header("AMP-Access-Control-Allow-Source-Origin: $ORIGIN"); header("Access-Control-Allow-Methods: GET, POST, OPTIONS"); header("Access-Control-Allow-Credentials: true"); header("access-control-allow-headers: Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token"); header("access-control-expose-headers: AMP-Access-Control-Allow-Source-Origin, AMP-Redirect-To");

And the image implaies and some of the instructions imply that one need include the amp cache elements in this unfortunately there is no example of how this works and the syntax

https://www.ampproject.org/docs/guides/amp-cors-requests

implies that it should deal with these domains automatically:

If the Origin header is set:

If the origin does not match one of the following values, stop and return an error response:

*.ampproject.org
*.amp.cloudflare.com
the publisher's origin (aka yours)
where * represents a wildcard match, and not an actual asterisk ( * ).

If the value of the __amp_source_origin query parameter is not the publisher's origin, stop and return an error response.

If the two checks above pass, process the request.

The examples in the linked document show:


HTTP/2 200
access-control-allow-headers: Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token
access-control-allow-credentials: true
access-control-allow-origin: https://ampbyexample.com
amp-access-control-allow-source-origin: https://ampbyexample.com
access-control-allow-methods: POST, GET, OPTIONS
access-control-expose-headers: AMP-Access-Control-Allow-Source-Origin

So any ideas?

seomaz commented 6 years ago

@QES my current headers working:

access-control-allow-credentials:true access-control-allow-origin:https://www-cibercuba-com.cdn.ampproject.org access-control-expose-headers:AMP-Access-Control-Allow-Source-Origin amp-access-control-allow-source-origin:https://www.cibercuba.com amp-same-origin:true

web: https://www.cibercuba.com/actualidad

json: https://www.cibercuba.com/json/27061/nodes.json

erwinmombay commented 6 years ago

I dont believe CORS plays a part in this as amp-live-list requests are proxied through the cache. Looking into this right now.

spinmar commented 6 years ago

Thanks very much to look at it.

QES commented 6 years ago

@seomaz thanks that is exactly what I have (I think - but with my domains)

When this is cached by the cache the browser needs to know that it is OK - how do you include the Google Cache domains in the headers so that the browsers know to include them?

I think I remembered when this may have stopped working or what has changed since I cofirmed it was working last year.

Since then we have added - amp-access on all our pages and amp-list (in addition to amp-live-list)

QES commented 6 years ago

So I have worked out there are generic issues which I can now see caused by the amp-access element in the google cache so:

https://www.google.co.uk/amp/s/amp.sportsmole.co.uk/football/arsenal/league-cup/live-commentary/live-commentary-arsenal-vs-man-city_319529.html

I get the error:

Failed to load https://amp.sportsmole.co.uk/amp_ping/82adlfDUPAWVBF7dAF0JXu7hPOhd8yjH9UV4EbGEvImp6oD518g2xlnlRTNs2eIB0.83148165848165421/?rid=82adlfDUPAWVBF7dAF0JXu7hPOhd8yjH9UV4EbGEvImp6oD518g2xlnlRTNs2eIB&uri=https%3A%2F%2Famp.sportsmole.co.uk%2Ffootball%2Farsenal%2Fleague-cup%2Flive-commentary%2Flive-commentary-arsenal-vs-man-city_319529.html&pass=PIK&_=0.43455036688366455&host=amp.sportsmole.co.uk&ht=https:&FB=OK&ref=https%3A%2F%2Famp.sportsmole.co.uk%2Ffootball%2Farsenal%2Fleague-cup%2Flive-commentary%2Flive-commentary-arsenal-vs-man-city_319529.html&dynamic&__amp_source_origin=https%3A%2F%2Famp.sportsmole.co.uk: The 'Access-Control-Allow-Origin' header has a value 'https://amp.sportsmole.co.uk' that is not equal to the supplied origin. Origin 'https://amp-sportsmole-co-uk.cdn.ampproject.org' is therefore not allowed access. Have the server send the header with a valid value, or, if an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

The underling file being loaded has CORS headers of:

Access-Control-Allow-Credentials:true
access-control-allow-headers:Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token
Access-Control-Allow-Methods:GET, POST, OPTIONS
Access-Control-Allow-Origin:https://amp.sportsmole.co.uk
access-control-expose-headers:AMP-Access-Control-Allow-Source-Origin, AMP-Redirect-To
AMP-Access-Control-Allow-Source-Origin:https://amp.sportsmole.co.uk

What to me looks odd in the above is the issue is described as:

https://amp-sportsmole-co-uk.cdn.ampproject.org

as the problem origin but the URL being loaded from is

https://www.google.co.uk/amp/s/amp.sportsmole.co.uk/football/arsenal/league-cup/live-commentary/live-commentary-arsenal-vs-man-city_319529.html

That looks like the old domain nomeceture but not sure how that is being used here?

spinmar commented 6 years ago

Meanwhile I updated the response header cors. Let's see if it solves the problem

seomaz commented 6 years ago

@QES you need to add this domain https://amp-sportsmole-co-uk.cdn.ampproject.org look my example

QES commented 6 years ago

@seomaz but does that still work on the original origin?

access-control-allow-credentials:true access-control-allow-origin:https://www-cibercuba-com.cdn.ampproject.org access-control-expose-headers:AMP-Access-Control-Allow-Source-Origin amp-access-control-allow-source-origin:https://www.cibercuba.com amp-same-origin:true

my AMP pages are served on amp.sportsmole.co.uk and users view them at that location if I change access-control-allow-origin from amp.sportsmole.co.uk will it still work correctly when viewed from that origin and what if it was being viewed from a different cache?

Is it possible to have multiple domains on the access-control-allow-origin

for example can you have:

access-control-allow-origin: https://amp.sportsmole.co.uk https://amp-sportsmole-co-uk.cdn.ampproject.org

Is that valid and do you also need to include other potential caches? ie cloudflare?

QES commented 6 years ago

I have just checked and if I set it to

access-control-allow-origin: https://amp-sportsmole-co-uk.cdn.ampproject.org

Then it breaks when loaded from amp.sportsmole.co.uk

spinmar commented 6 years ago

@QES I think that access-control-allow-origin should be set to origin request header value (when amp-same-origin is not set and origin is in the allowed domains)

https://www.ampproject.org/docs/guides/amp-cors-requests

seomaz commented 6 years ago

@QES add this:

access-control-allow-credentials:true access-control-allow-origin:https://amp-sportsmole-co-uk.cdn.ampproject.org access-control-expose-headers:AMP-Access-Control-Allow-Source-Origin amp-access-control-allow-source-origin: https://amp.sportsmole.co.uk amp-same-origin:true

QES commented 6 years ago

Reading more about this the problem is that I have to change the

access-control-allow-origin

depending on if it is loaded from

amp.sportsmole.co.uk OR amp-sportsmole-co-uk.cdn.ampproject.org OR amp.sportsmole.co.uk.amp.cloudflare.com etc

Now I think this should be set in the __amp_source_origin element sent from the cache on the URL but in the example I looked at that did not seem to be the case in when looking at it in developer mode.

What is correct is that the URL itself:

https://amp-sportsmole-co-uk.cdn.ampproject.org/v/s/amp.sportsmole.co.uk/football/chelsea/transfer-talk/news/dortmund-preparing-to-miss-out-on-batshuayi_319745.html?amp_js_v=0.1&usqp=mq331AQECAEYAQ%3D%3D#origin=https%3A%2F%2Fwww.google.co.uk&prerenderSize=1&visibilityState=visible&paddingTop=54&p2r=0&horizontalScrolling=0&csi=0&viewerUrl=https%3A%2F%2Fwww.google.co.uk%2Famp%2Fs%2Famp.sportsmole.co.uk%2Ffootball%2Fchelsea%2Ftransfer-talk%2Fnews%2Fdortmund-preparing-to-miss-out-on-batshuayi_319745.html&history=1&storage=1&cid=1&cap=swipe%2CnavigateTo%2Ccid%2Cfragment%2CreplaceUrl

Seems to say #origin=https%3A%2F%2Fwww.google.co.uk which isn't useful either and not correct.

While the call to the problem (the amp-access JSON end point) is

https://amp.sportsmole.co.uk/amp_ping/82adlfDUPAWVBF7dAF0JXu7hPOhd8yjH9UV4EbGEvImp6oD518g2xlnlRTNs2eIB0.159305393002102671/?rid=82adlfDUPAWVBF7dAF0JXu7hPOhd8yjH9UV4EbGEvImp6oD518g2xlnlRTNs2eIB&uri=https%3A%2F%2Famp.sportsmole.co.uk%2Ffootball%2Fchelsea%2Ftransfer-talk%2Fnews%2Fdortmund-preparing-to-miss-out-on-batshuayi_319745.html&pass=PIK&_=0.2796621366419336&host=amp.sportsmole.co.uk&ht=https:&FB=OK&ref=https%3A%2F%2Famp.sportsmole.co.uk%2Ffootball%2Fchelsea%2Ftransfer-talk%2Fnews%2Fdortmund-preparing-to-miss-out-on-batshuayi_319745.html&dynamic&__amp_source_origin=https%3A%2F%2Famp.sportsmole.co.uk

Is using:

__amp_source_origin=https%3A%2F%2Famp.sportsmole.co.uk

And so it breaks because the ORIGIN is

https://amp-sportsmole-co-uk.cdn.ampproject.org

spinmar commented 6 years ago

It seems that I solved my problem with a correct handle of amp response header. I follow the instructions in this page https://www.ampproject.org/docs/guides/amp-cors-requests and now it seems that Google amp cache has the updated version of page.

QES commented 6 years ago

Hi @spinmar can you access your amp pages direct as amp pages off an origin and does it work both on the origin and on the amp google cache - if so how have you collected the info needed to modify the headers that are different between these two scenarios?

QES commented 6 years ago

Hi @spinmar @seomaz @erwinmombay after much playing I think I have worked this out, I do think the documentation at https://github.com/ampproject/amphtml/blob/master/spec/amp-cors-requests.md while improved could be better, not because it is wrong but because it maybe isn't made clear in the text that you need to use the "Request Header" Origin to check and use for this.

With the benefit of hindsight and knowing what I was looking for I found the message in the documentation that is key to making this work.

I think this needs something that highlights the issue in a more forceful way as there are forever people having CORS issues.

QES commented 6 years ago

Having solved the CORS issues - looking at the amp-live-list issue which this originally was about.

This is STILL an issue looking at a page that updates with the amp-live-list page an example is:

https://www.google.co.uk/amp/s/amp.sportsmole.co.uk/live-scores/

when there are games active there will be updated times every minute or so. These update seamlessly on the origin version:

https://amp.sportsmole.co.uk/live-scores/

Checking the Console and the Network traffic the cache version is getting every 15 seconds the file:

https://amp-sportsmole-co-uk.cdn.ampproject.org/v/s/amp.sportsmole.co.uk/live-scores/?amp_js_v=0.1&usqp=mq331AQECAEYAQ%3D%3D&amp_latest_update_time=1520011904&__amp_source_origin=https%3A%2F%2Famp.sportsmole.co.uk

while the ORGIN version is getting

https://amp.sportsmole.co.uk/live-scores/?amp_latest_update_time=1520012676&__amp_source_origin=https%3A%2F%2Famp.sportsmole.co.uk

The issue that I notice is that the Request Headers are different and do not include an Origin.

:authority:amp-sportsmole-co-uk.cdn.ampproject.org
:method:GET
:path:/v/s/amp.sportsmole.co.uk/live-scores/?usqp=mq331AQECAEYAQ%3D%3D&amp_js_v=0.1&amp_latest_update_time=1520012933&__amp_source_origin=https%3A%2F%2Famp.sportsmole.co.uk
:scheme:https
accept:text/html
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.9
amp-same-origin:true
cache-control:no-cache
cookie:AMP_CANARY=1; AMP_EXP=amp-date-picker
dnt:1
pragma:no-cache
referer:https://amp-sportsmole-co-uk.cdn.ampproject.org/v/s/amp.sportsmole.co.uk/live-scores/?usqp=mq331AQECAEYAQ%3D%3D&amp_js_v=0.1
user-agent:Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1
QES commented 6 years ago

Looking further at this the called URL

https://amp-sportsmole-co-uk.cdn.ampproject.org/v/s/amp.sportsmole.co.uk/live-scores/?usqp=mq331AQECAEYAQ%3D%3D&amp_js_v=0.1&amp_latest_update_time=1520015843&__amp_source_origin=https%3A%2F%2Famp.sportsmole.co.uk

is containing an empty page with no data when loading from the cache but

https://amp.sportsmole.co.uk/live-scores/?usqp=mq331AQECAEYAQ%3D%3D&amp_js_v=0.1&amp_latest_update_time=1520015843&__amp_source_origin=https%3A%2F%2Famp.sportsmole.co.uk

loads the page as expected. Not sure what the cache is doing at this point if it is looking for change in some way or doing something clever.

spinmar commented 6 years ago

@QES Sorry for the late answer but I didn't see your request. In my live amp page I solved the problem setting correctly the amp response header. It works in google amp cache and directly too. I followed the pseudo code here:

https://www.ampproject.org/docs/guides/amp-cors-requests

to set the response header.

QES commented 6 years ago

@spinmar hi thanks - I managed eventually to work out the correct incantation for this which seems to be overly obfuscated in the documentation it boils down to:

amp-access-control-allow-source-origin must be YOUR origin the publisher domain access-control-allow-origin must be the location of the file is loaded from (and that must be from an approved list of domains)

However the complication is working out how the cache tells the server that it is this origin, you can check the HEADERS either ORIGIN or REFERER depending on what is set, to me that isn't made clear enough but once you explain it, it becomes obvious :)

Thanks for the feedback and help.

Unfortunately, there is another problem the amp-live-list is not actually getting any data in the subsequent calls to the origin.

access-control-allow-credentials:true
access-control-allow-origin:https://amp-sportsmole-co-uk.cdn.ampproject.org
access-control-expose-headers:AMP-Access-Control-Allow-Source-Origin
amp-access-control-allow-source-origin: https://amp.sportsmole.co.uk
amp-same-origin:true
QES commented 6 years ago

@spinmar are you seeing your pages load the amp-live-list updates successfully on the google cache?

I'm still not seeing pages being loaded (and I'm not getting underlying CORS errors any more) :)

ampprojectbot commented 6 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 6 years ago

This is a high priority issue but it hasn't been updated in awhile. @erwinmombay Do you have any updates?

ampprojectbot commented 6 years ago

This is a high priority issue but it hasn't been updated in awhile. @erwinmombay Do you have any updates?

ampprojectbot commented 6 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 6 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 6 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 6 years ago

This is a high priority issue but it hasn't been updated in awhile. @erwinmombay Do you have any updates?

ampprojectbot commented 6 years ago

This is a high priority issue but it hasn't been updated in awhile. @erwinmombay Do you have any updates?

ampprojectbot commented 6 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 5 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 5 years ago

This is a high priority issue but it hasn't been updated in awhile. @erwinmombay Do you have any updates?

ampprojectbot commented 5 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 5 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 5 years ago

This is a high priority issue but it hasn't been updated in awhile. @erwinmombay Do you have any updates?

ampprojectbot commented 5 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 5 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

ampprojectbot commented 5 years ago

This is a high priority issue but it hasn't been updated in awhile. @erwinmombay Do you have any updates?

ampprojectbot commented 5 years ago

This issue doesn't have a category which makes it harder for us to keep track of it. @erwinmombay Please add an appropriate category.

erwinmombay commented 5 years ago

This bug is internally tracked b/124009333 as it is most likely a cache problem

aghassemi commented 5 years ago

assigning to @twifkak based on b/124009333

xavierleune commented 5 years ago

Hi @twifkak any news on that issue ? Thanks

twifkak commented 5 years ago

@xavierleune There's been some internal design/research, but it's still pending some work to figure out how to do it without causing some other problems. Unfortunately nothing to announce yet.

xavierleune commented 5 years ago

@twifkak thanks for your feedback. Do you have any details to share about the root causes ? As far as I can see, only a small number of publishers are suffering this issue. Maybe there is some workaround to prevent it from our side ?

twifkak commented 5 years ago

@xavierleune Can't share many details. There are multiple caching tiers that are contributing to the problem, and different solutions to different tiers. You can help reduce problems from one of the tiers by making sure you specify a cache-control with a short s-maxage (or no s-maxage and a short max-age). That will not fully prevent it.