race-condition: cache update causing app crashes

jliebrand commented 6 years ago

Hi there - this bug is a spin off from https://github.com/facebook/create-react-app/issues/3688

It's been a little hard to reproduce this problem successfully, but it's something our client side logging is showing to happen quite often (for a percentage of users every time we deploy a new app).

tl;dr; - it appears the cache/app can get updated such that network requests for JS are made that no longer exist, causing the client to attempt to parse the 404.html the server returns; making everything fall on its *rse.

background:

we run a react app (built with create-react-app) out of an S3 bucket
we deploy with "aws s3 sync --delete", meaning the old files are removed on deploy
there appear to be cases where both for the main bundle as well as for chunks, the browser falls over due to receiving 404.html responses for requests to outdated JS files.

I think what is going on is described in the below sequence diagram. But since this is a race condition, it's very hard to nail down specifically. I know that once the sw-cache is updated, we ought to show a banner to the user to refresh, but that's not necessarily the solution. As the user might not want to refresh, or more specifically, any event could trigger the loading of a chunk that coincides with us showing that banner.

Would it not be better to not "replace" the existing cache, but create a new cache; leaving the in-memory app to use the old cache until the user refreshes (at which point the old cache can get removed)? That means you can still show the banner to reload, but the app won't fail (nor make requests to the network for outdated files) ?

Then again, perhaps I'm wrong about how the cache is updated today, so please correct me if that's the case. But we (and others) do see these "unexpected token <" errors stemming from the client trying to parse 404.html files as JS...

race-condition

marc1161 commented 6 years ago

Having the same issue...

jeffposnick commented 6 years ago

By default, sw-precache will call skipWaiting(), which will be more aggressive in invalidating previously cached entries and moving new clients to the latest version of the cache. This can lead to issues in lazy-loading scenarios.

You can set skipWaiting: false in your build config to disable this.

I know that this came up from the context of create-react-app, in which you don't easily have control over the underlying sw-precache config. I've had an open PR against c-r-a for a while now which will set skipWaiting: false (among other changes), and that will hopefully get merged in for the next major c-r-a release.

marc1161 commented 6 years ago

@jeffposnick I tried to add skipWaiting to false on my webpack prod and still the same problem... I think the problem comes from the Web Sever serving the static website. In my case, Cloudfront. Anyway, no clue on how to solve it.. I tried to add no cahce flags on both service-worker.js and index.html on S3 and adding the Use Origin Cache Headers on Cloudfront and it did not work. Furthermore, I tried to customize the TTL on Cloudfront setting it to 0 and still does not work... From my perspective I believe it is something wrong with service workers and the way it is handled in different browsers.. (e.g. not having the same behaviour on Chrome and on Edge, IE).

jliebrand commented 6 years ago

@marc1161 sounds like you have a far more reproducible case than I do, which is great. Just to confirm: if you disable the SW all together: does that "solve" the issue? It'd be good to confirm it's the SW cache that's causing the problems and not the underlying browser's cache itself

jliebrand commented 6 years ago

Also: do you have one main js bundle, or also chunks?

marc1161 commented 6 years ago

What i did was to unregister the service worker from my main js file. And I have one main js bundle with a hash so every time I make a new build I get a different js filename. I am still reproducing some problems by the way...

jliebrand commented 6 years ago

You're still able to reproduce the problem even with the SW unregistered @marc1161 ? If that's true, it would appear to be a mismatch between the browser's own cache and S3?

marc1161 commented 6 years ago

Yes exactly. That's the problem. The problem is a missmatch between browsers cache and Cloudfront cache not S3. I am using Cloudfront cache with TTL=0. This way it should not cache anything... Still, do not know how to solve it. I thought unregistering the sw would make the trick but it did not.

jliebrand commented 6 years ago

@jeffposnick correct me if I'm wrong, but the SW cache doesn't change anything wrt the browser's own caching does it?

Or rather, when the browser makes the request it is first parsed by the SW, but at what point (if any?) does the browser check it's own cache? Could all of this be a problem with the normal cache rather than the SW cache?

(regardless, the skipWaiting is still a good flag to have, but I'm trying to understand if the issues I'm seeing are even related to SW at all)

jeffposnick commented 6 years ago

When lazy-loading versioned assets, you can run into problems due to those assets being missing from the remote server due to a number of circumstances. This is unfortunately a common problem that extends beyond the realm of service worker caching, and which doesn't get a great deal of attention. (I'm going to CC: @malchata here, as he's been working on some lazy-loading guidance for https://developers.google.com/web/)

Here are some lazy-loading scenarios in which things might go wrong:

You're using a cache-first service worker that calls skipWaiting(). The old HTML is loaded cache-first in response to a navigation, and it contains references to versioned assets that used to exist alongside the HTML in the service worker's cache, but which were expunged when the service worker transitioned from installed to activated (bypassing waiting). The workaround here is not to call skipWaiting(), which will keep the versioned assets that match up with the HTML available in the service worker cache (even if they've been deleted from the server).
You're not using service workers at all, but your app is long-lived, and a user has a tab open from, say, 24 hours ago. You redeploy your site in the meantime, and when the existing tab tries to lazy-load the versioned asset for a new route, it requests a resource from the server that no longer exists.
You're not using service workers at all, but you are opting in to caching of HTML content, either implicitly (i.e. GitHub pages caches all URLs, including HTML documents, for 10 minutes) or explicitly (by configuring a CDN to cache HTML). This can lead to a user being served an out-of-date HTML document which references versioned assets which no longer exist at the remote server, and which might not exist in the CDN proxy cache either.

jliebrand commented 6 years ago

Great summary @jeffposnick ... I'm going to spend some time adding more logging in our app to identify which of the various scenarios is the one that is tripping us up the most.

Ultimately, considering a mismatch simply can happen, I'm going to see how best to guard against this and tell the user to refresh... With the browser attempting to parse an 404.html as JS and barfing on unexpected tokens, this might be hard to catch :-(

voshawn commented 6 years ago

@jliebrand were you able to resolve this issue?

From AWS's docs, it looks like if your Minimum TTL is 0, but you don't set a cache-control max-age header, it will be up to the browser to handle caching.

Therefore, I have modified my deployment to aws s3 sync --delete --cache-control max-age=0 build/, thereby explicitly setting max-age to 0. This should eliminate the browser being out of sync.

Let me know if this works for you!

jliebrand commented 6 years ago

Hey @voshawn - sweet, I'll give that a go and see if that improves things.

I implemented a different work around, which basically ensures that my top level index.html handles the error event on the window and if it's anything like "unexpected token <" or "Loading chunk \d failed" then I unhide a div on the page which simply says "The app has been updated, please reload"... assuming that the sw in the background updates, then clicking the reload link "fixes things"...

Bit of a kludge, but it works and handles the other edge cases as well... Still, I'll give your solution a go and see if that reduces the number of times this happens -thanks

voshawn commented 6 years ago

Hi @jliebrand,

Yeah, that makes sense. I was thinking about doing something like that too as a backup. Would you mind sharing a snippet of how your workaround works?

I'm not sure how I would check for the "unexpected token <" in index.html without potentially loading the js twice.

jliebrand commented 6 years ago

Sure, it's a dirty hack, but it works.... Just add this in your main index.html / template / whatever you have. (It will also fire on other script errors, but hey ho)

  <body>
    <script type="text/javascript">
      window.onerror = function(err) {
        if (err === 'Script error.' ||
            /Loading chunk \d failed./.test(err) ||
            err === 'Uncaught SyntaxError: Unexpected token <') {
          const t = document.getElementById('script-failed');
          if (t) {
            t.style.display = 'block';
          }
        }
      };
    </script>
    <div id="script-failed" style="display: none" class="reload-please">The app has been updated. Please <a href="/reload">reload this page</a>.</div>
    ...

(I used /reload as a dedicated end point so that I can measure how many people go through this flow - it just redirects to /)

voshawn commented 6 years ago

Thanks! I implemented that. It appears your workaround is the best solution for now.

I am still occasionally seeing the "loading chuck" error so I'm not sure my fix worked.

plegner commented 6 years ago

I don't think browsers should parse JS files that were returned with a 404 response. Is it possible that you're intercepting the response in the service worker, and resetting the status code to 200 when forwarding it to the page?

(Of course, this does not solve the problem of requesting a file that no longer exists – but that would also happen without service workers.)

jliebrand commented 6 years ago

@plegner no the issue is that the server is returning a 404.html page with a 200 http code... Because that's the easiest way to deal with SPA being hosted on S3...

stewartmegaw commented 6 years ago

I have the same issue. Implementing the dirty fix from jliebrand above.

GoogleChromeLabs / sw-precache

race-condition: cache update causing app crashes #351