gatsbyjs / gatsby

The best React-based framework with performance, scalability and security built in.
https://www.gatsbyjs.com
MIT License
55.25k stars 10.32k forks source link

Why are page-data.json files not content hashed? #15080

Closed mmv08 closed 4 years ago

mmv08 commented 5 years ago

Hi gatsby team :) We've just spent few hours on debugging an issue with content not appearing on the website and turned out that it was because of page-data.json cached.

Why does it not use content hash? Should cache be handled another way?

marco-th commented 5 years ago

@mikheevm The Gatsby Team describes in this blog post why they are not hashing the page-data.json files anymore.

I am also running into some issues because of this. Is there a way to enable the hashing of the page-data.json files again?

gaplo917 commented 5 years ago

There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton

If the new version .md file is incompatible to previous version (i.e. frontmatter schema) OR If the new React Component is incompatible to previous page-data.json

Which lead to:

Gatsby should have a config to allow user toggle page-data.json hash feature for cache friendly environment. Until someday the website development is stable(only content update, no React code changes), then user can switch the feature off for the performance gain.

Currently, I have to downgrade to 2.8.8 for rapid agile development

KyleAMathews commented 5 years ago

@gaplo917 how do you know you're seeing this error? We include a build hash in each page-data.json so if a client loads a json file and sees it came from a newer build, it'll force a full refresh to get the new component code.

If that's not working that's a bug we need to fix. More details about what's happening and reproduction instructions would be great!

KyleAMathews commented 5 years ago

Also in general, you should only be caching files that are in the static folder and .js files as described in the caching docs https://www.gatsbyjs.org/docs/caching/#static-files

The new json files shouldn't be cached. If you're running into troubles, check out your cache-control settings.

gaplo917 commented 5 years ago

Also in general, you should only be caching files that are in the static folder and .js files as described in the caching docs https://www.gatsbyjs.org/docs/caching/#static-files

The new json files shouldn't be cached. If you're running into troubles, check out your cache-control settings.

Thanks for your advice! I have assumed that everything can be cached.

Because gatsby 2.8.8 will generate hash for all page-data.json. This guarantee a new/old html will load new/old/same json

Will try your advice and see the result.

leonfs commented 5 years ago

Thank-you everybody for bringing this problem into notice.

It has also been debated in the original PR thread (https://github.com/gatsbyjs/gatsby/pull/14359), with some great contributions on how the manual cache-invalidation technique is not really a great solution.

Our team has been working on a small script (that runs on Gatsby's onPostBuild hook) that appends a hash to all page-data.json files and updates the references. So far it has proven to work but we haven't pushed it to our production site because we've been experiencing other problems with gatsby-plugin-sharp and the newer versions of Gatsby (different issue).

If anyone is interested, we could share our approach.

mmv08 commented 5 years ago

@leonfs Would be really great if you share it :)

harrygreen commented 5 years ago

@mikheevm something along the lines of..

const fs = require("fs").promises;
const glob = require("glob");
const md5 = require("md5");
const path = require("path");

exports.onPostBuild = async () => {
  const publicPath = path.join(__dirname, "public");
  const hash = md5("replace-with-your-own-hash");

  const jsonFiles = glob.sync(`${publicPath}/page-data/**/page-data.json`);
  console.log("[onPostBuild] Renaming the following files:");
  for (let file of jsonFiles) {
    console.log(file);
    const newFilename = file.replace(`page-data.json`, `page-data.${hash}.json`);
    await fs.rename(file, newFilename);
  }

  const htmlAndJSFiles = glob.sync(`${publicPath}/**/*.{html,js}`);
  console.log("[onPostBuild] Replacing page-data.json references in the following files:");
  for (let file of htmlAndJSFiles) {
    const stats = await fs.stat(file, "utf8");
    if (!stats.isFile()) continue;
    console.log(file);
    var content = await fs.readFile(file, "utf8");
    var result = content.replace(/page-data.json/g, `page-data.${hash}.json`);
    await fs.writeFile(file, result, "utf8");
  }
};

Disclaimer! As @leonfs point out, this seems to work for us - but it's obviously a hack to suit our caching implementation. No guarantee it'll work for anyone else.

-- Edited code, as per https://github.com/gatsbyjs/gatsby/issues/15080#issuecomment-514221990.

leonfs commented 5 years ago

If anyone else tries this approach, it would be great to add comments on the results gotten.

city41 commented 5 years ago

I have been hitting this issue with a Gatsby site that is deployed to Github Pages. So I unfortunately have no control over how the server decides to set caching headers on page-data.json.

In my case I find after a new deploy, that stale page-data.json data gets used due to it being cached, and you see the page flicker from the new data back to the old once the page-data.json request returns.

city41 commented 5 years ago

I added the onPostBuild that @harrygreen posted (thanks for that). But I am also finding that the root app-<sha>.js file does not update correctly.

If I:

  1. gatsby build
  2. note the name of app-sha.js inside public, currently it is app-560e4b2f43729239ce7d.js for me
  3. add some new data such that a new build would yield different html
  4. gatsby build again
  5. note the name of app-sha.js again. I expect it to be different, but it is still app-560e4b2f43729239ce7d.js

So since the file name does not update, my browser uses the old cached version, which loads the wrong page-data.json file.

city41 commented 5 years ago

btw @harrygreen, I think it should be if (!stats.isFile()) continue;, as the return exits the function.

wardpeet commented 5 years ago

If this works and it's widely used. Feel free to create a plugin that changes this behaviour. We're happy to take new plugins :tada: https://www.gatsbyjs.org/contributing/submit-to-plugin-library/

harrygreen commented 5 years ago

@city41 Glad it could help you somewhat. We're not using it in production yet because of a different but consistent failure inside gatsby-plugin-sharp (also introduced with the new updates).

As for this:

btw @harrygreen, I think it should be if (!stats.isFile()) continue;, as the return exits the function.

We do want the inverse of your suggestion though; we only want to rename files, not folders. We want to exit the function early if it's not a file.

Is the app-<sha>.js issue a result of my code, or unrelated?

@wardpeet Thanks for the tip. My code really is a hack against Gatsby, so I'm reluctant to publish anywhere.. but maybe one day. I'd much rather page-data.[hash].json :)

city41 commented 5 years ago

@harrygreen using return instead of continue means as soon as you hit a directory, you stop processing files. continue causes the for loop to move onto the next iteration, so the next file can get a chance to process.

app-<sha>.js is not a result of your code, but it is related. It's a file that ideally needs a hash in its name because otherwise it will get cached just like page-data.json files did. It used to not matter that it got cached, because it only had page-data.json embedded in it. Now with the hash fix, it has page-data.<hash>.json embedded in it, thus using a cached version is now bad.

I agree this shouldn't be a plugin. I honestly feel like this is a bug in Gatsby.

harrygreen commented 5 years ago

@city41 good spot 🀜 (it was originally inside a function but I switched to for (let .. of)). Thanks - will update the snippet.

The issue is Gatsby's rule to not cache certain assets, e.g. HTML, and now this JSON - both of which require a lockstep relationship). Now that the JSON hash has been lost, some flexibility over the possible caching strategies has been lost. This comment sums up the issue if cache-clearing is required on a CDN.

wardpeet commented 5 years ago

@city41 it's not a gatsby bug, page-data.json shouldn't be cached. It gives us the opportunity to build pages separately on only change what's needed.

I do agree that each new version of app-560e4b2f43729239ce7d.js should get a new hash if things changed. So that's definitely a bug. Do you mind opening a new one with this information?

see https://github.com/gatsbyjs/gatsby/issues/15080#issuecomment-506954877

city41 commented 5 years ago

I think app-sha.js is built by webpack? From its perspective, it hasn't changed. So that one might be trickier.

Maybe bug is too strong of a word? Maybe this should be an option the user can opt into? It makes using Gatsby when you don't have access to the server difficult. Gatsby is a great choice when you only have static hosting (like gh-pages), but this caching issue makes it unusable in those scenarios without the onPostBuild workaround.

If Gatsby fingerprinted page-data.json files, then app-sha.js would probably naturally get a change, and webpack would re-fingerprint it I imagine.

jaredsilver commented 5 years ago

I agree with @city41 and many of the other folks who have chimed in here. We had a long discussion internally, and our consensus is that without being able to host everything at the edge cache level, using Gatsby as opposed to something like Next.js serves very little purpose. The entire point of rendering at build time is that we don't need to have a server and can instead host the built files directly on the edge. If files like page-data.json are not hashed, it eliminates the ability to do that because of the cache invalidation issues referenced by @roylines and @leonfs in this thread: https://github.com/gatsbyjs/gatsby/pull/14359.

For our purposes, we already were serving index pages from an nginx server and using assetPrefix for various other reasons for the time being, so we just modified @harrygreen's script to work for that use case (see below).

Going forward, however, we will need to reevaluate the value of a tool like Gatsby compared to a tool like Next, given that build-time rendering introduces a whole host of issues we need to work around and very few benefits over traditional server side rendering if the files have to be hosted on a server instead of on the edge.

Happy to chat more about this and alternative approaches -- we love Gatsby and are very grateful for the amazing work ya'll are doing!

Here's the gatsby-node.js script to enable hosting the page-data files on a server while using assetPrefix to host assets in edge cache:

const path = require('path');
const fs = require('fs').promises;
const glob = require('glob');

exports.onPostBuild = async () => {
  const publicPath = path.join(__dirname, 'public');

   const htmlAndJSFiles = glob.sync(`${publicPath}/**/*.{html,js}`);
  console.log(
    '[onPostBuild] Replacing page-data.json references in the following files:'
  );
  for (let file of htmlAndJSFiles) {
    const stats = await fs.stat(file, 'utf8');
    if (!stats.isFile()) return;
    console.log(file);
    var content = await fs.readFile(file, 'utf8');
    var result = content.replace(
      'https://static.datacamp.com/page-data',
      '/page-data'
    );
    await fs.writeFile(file, result, 'utf8');
  }
};
xavivars commented 5 years ago

It seems that the amount of people using gatsby in a way that needs hashing (as a much simpler way to manage CDNs and new deployments) it's actually pretty big.

Release 2.9 was a huge step forward for Gatsby, increasing performance for big sites. Thanks a lot for that amazing work!

But the side effects of that release (impossible to fully disable client side navigation yet, issues with easy deployments to s3+cloudfront) clearly makes 2.9 a non backwards compatible release, with quite a lot of people now locked to 2.8.x due to all those issues.

Would it make sense to introduce @jaredsilver solution in core, or via a generic plugin (supported by gatsby project itself)? That would make Gatsby 2.9+ closer to a real 2.8-compatible version.

lifehome commented 5 years ago

I propose introducing the solution by @jaredsilver into the core, as it would allow easy cache invalidation on the edge, and a really streamlined, rapid deployment of the site, without all the waiting on "that cache is still alive" somewhere else.

Also, this would allow us to literally cache everything, decreasing the impact to the origin and maximize the use of CDN.

kimbaudi commented 5 years ago

Hi, I noticed that my gatsbyjs website would show a blank screen and become really unresponsive (approximately 20-50 seconds) on some mobile browsers (Chrome and Opera) after performing a hard refresh. Inspecting the network using Chrome DevTools indicates that the delay comes from page-data.json

On page refresh, the request for page-data.json is pending

issue1

It took about 50 seconds for the blank screen to go away and the page to refresh.

issue2

I've already mentioned this in https://github.com/gatsbyjs/gatsby/issues/11006#issuecomment-517893648.

I'm not sure if adding a hash to page-data.json filename would resolve this issue, but I wanted to share an issue I am facing that is causing page-data.json to load really slowly or not at all.

KyleAMathews commented 5 years ago

@jaredsilver where do you host your site? Perhaps there's a misunderstanding here? I'm not sure what you mean by "edge host". Any cdn can work with the changes in 2.9 -- you just need to set the cache-control settings correctly. Is there something we can document for you?

KyleAMathews commented 5 years ago

Perhaps the reason we overlooked this is that most CDNs designed to serve static sites handle this correctly already i.e. they serve a file from the edge until a new build invalidates the file. Netlify/Zeit/Firebase etc. work this way. Most/all CDNs have a purge or invalidation API that can be setup to do this as well.

What CDNs are y'all using? Let's put together some docs on setups.

This does complicate setup I agree and can be error prone. If you haven't had a chance to read about why we needed to remove the page manifest (and hence the hashed data files) please read the blog post https://www.gatsbyjs.org/blog/2019-06-12-performance-improvements-for-large-sites/

Happy to dive more into the rational behind it.

jaredsilver commented 5 years ago

I was in the middle of writing out a long response detailing why you're wrong when I realized that I am wrong πŸ˜„

If we use a combination of must-revalidate and max-age=0, CloudFront's edge nodes will hold the asset and only download a new version from the origin if the file has changed. This is the behavior we want, whereas adding a no-cache directive would have prevented the file from being held in CloudFront at all. For everyone concerned about the out of sync invalidations, I believe this should eliminate the need for any invalidations at all -- whenever the files change, they will be fetched anew from the origin server and stored at the edge level.

I have opened a PR to update the caching page of the docs with this information so hopefully other folks won't run into this issue in the future. Feel free to check it out here: https://github.com/gatsbyjs/gatsby/pull/16368

Thanks, @KyleAMathews!

And if anyone else in the thread has any questions/concerns, I'd be happy to explore this further.

lifehome commented 5 years ago

Hey @jaredsilver, mind if take a look at Cloudflare? We're using a mix of Cloudfront and Cloudflare to ensure the cached b/w and security, tho it seems despite the change from Cloudfront, Cloudflare needs to manually clearing the cache IMO.

jaredsilver commented 5 years ago

@lifehome It looks like Cloudflare does not cache page-data.json or index.html by default, which is good. Unless you've manually configured them to be cached, it looks like it hopefully shouldn't be a problem. If you have manually configured them to be cached, it's possible that the headers are wrong or that Cloudflare is overwriting your headers. I would load the file in your browser and verify that the must-revalidate and max-age=0 directives are there for the index.html and page-data.json files.

Note: if you're using a service worker, it looks like Cloudflare does cache that by default since it ends in a .js extension. That could very likely be the issue. You can overwrite that behavior with page rules.

kpennell commented 5 years ago

having the same thing as @kimbaudi it seems. Really really slow loading pages for lots of data on mobile.

KyleAMathews commented 5 years ago

@kpennel is this consistently reproducible? What is your host?

kpennell commented 5 years ago

@KyleAMathews I'll invite you to a private repo. I think I'm just being a dummy and loading too much damn data on one page. I was hoping react-window + gatsby would magically make this possible. I'm using netlify.

kpennell commented 5 years ago

@KyleAMathews What's odd is even when I'm loading on a page that (in theory) should not be loading so much data, gtmetrix tests are saying the page is loading 10MB

https://gtmetrix.com/reports/acrotagsprod.netlify.com/QvEbcg9m

image

I thought perhaps it would only be loading so much data on my main page (with 6000 videos in a json)

edit: I also setup gatsby-awesome-pagination to see if I could load less data on pages that are paginated. In theory, I was hoping that by only loading 10 video cards (instead of several thousand), I could make the site much faster. The commons.js is 4MB but the page-data.json is now 131kb. So quite a bit better.

kimbaudi commented 5 years ago

Sorry, I didn't respond sooner, but I am using Netlify to deploy and host my site. I should probably look into gatsby-plugin-netlify-cache. I will report back some time this weekend (or next week) after I install the plugin and see if it helps with this issue.

kimbaudi commented 5 years ago

Hi, I want to report back and say that installing gatsby-plugin-netlify seems to solve the issue for me. I just installed the plugin (npm install --save gatsby-plugin-netlify) and waited for my site to deploy on Netlify. The result is immediate! GatsbyJS is extremely fast now! πŸš€ πŸŽ‰ 😸

This issue can be closed for me at least. I was wondering why my website was becoming unresponsive in Chrome and Opera browser on one of my old tablet after performing a hard refresh. Looking at the network activity, I noticed that the website came to a halt while waiting for page-data.json to load. But after installing gatsby-plugin-netlify, I no longer have this issue. And I now realize that page-data.json shouldn't be content hashed.

wardpeet commented 5 years ago

@kpennell your repository is pretty fascinating. The amount of page-data.jsons are amazing in a bad way. Mind if you could share it with me as well?

@kimbaudi is https://clever-kepler-aac0d1.netlify.com/ still the version which was slow before gatsby-plugin-netlify?

kpennell commented 5 years ago

@wardpeet Sure thing. I'm trying pagination as a way around this challenge.

tbrannam commented 5 years ago

I was in the middle of writing out a long response detailing why you're wrong when I realized that I am wrong πŸ˜„

If we use a combination of must-revalidate and max-age=0, CloudFront's edge nodes will hold the asset and only download a new version from the origin if the file has changed. This is the behavior we want, whereas adding a no-cache directive would have prevented the file from being held in CloudFront at all. For everyone concerned about the out of sync invalidations, I believe this should eliminate the need for any invalidations at all -- whenever the files change, they will be fetched anew from the origin server and stored at the edge level.

I have opened a PR to update the caching page of the docs with this information so hopefully other folks won't run into this issue in the future. Feel free to check it out here: #16368

Thanks, @KyleAMathews!

And if anyone else in the thread has any questions/concerns, I'd be happy to explore this further.

Doesn't this imply at least a round trip to the origin to revalidate? It was a nice behavior before that it was possible to skip revalidation as well. Or am I misunderstanding what is being suggested?

gatsbot[bot] commented 5 years ago

Hiya!

This issue has gone quiet. Spooky quiet. πŸ‘»

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! πŸ’ͺπŸ’œ

jaredsilver commented 5 years ago

@tbrannam Sorry, I missed your reply!

Doesn't this imply at least a round trip to the origin to revalidate? It was a nice behavior before that it was possible to skip revalidation as well. Or am I misunderstanding what is being suggested?

Yes, I believe you are correct. However, because the files themselves are not being transferred and the roundtrip is occurring between an edge node and the source server, it should be absurdly fast. It seems highly unlikely that it would materially affect any important page speed metric.

nihakue commented 4 years ago

If we use a combination of must-revalidate and max-age=0, CloudFront's edge nodes will hold the asset and only download a new version from the origin if the file has changed. This is the behavior we want, whereas adding a no-cache directive would have prevented the file from being held in CloudFront at all. For everyone concerned about the out of sync invalidations, I believe this should eliminate the need for any invalidations at all -- whenever the files change, they will be fetched anew from the origin server and stored at the edge level.

I believe you are confusing 'no-cache' and 'no-store' directives. AFAICT 'no-cache' and 'max-age=0, must-revalidate' behave the exact same way apart from the fact that with 'no-cache', if the origin fails to revalidate then the cache can still respond with the cached content, wheras with must-revalidate it is meant to return a 504 error. See this Stack Overflow question for more details.

Also, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching#Cache_validation

jaredsilver commented 4 years ago

@nihakue You could very well be right! I am not an expert on caching by any means. Are you certain that no-cache also allows for caching by intermediary caches? Would you be open to taking a look at the Caching part of the docs and updating anything that seems incorrect? https://github.com/gatsbyjs/gatsby/blob/master/docs/docs/caching.md

nihakue commented 4 years ago

@nihakue Are you certain that no-cache also allows for caching by intermediary caches?

Yes, pretty confident. That's why no-store exists.

No caching The cache should not store anything about the client request or server response. A request is sent to the server and a full response is downloaded each and every time. Cache-Control: no-store

Cache but revalidate A cache will send the request to the origin server for validation before releasing a cached copy. Cache-Control: no-cache

from: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching#Cache_validation

I've confirmed that Cloudfront respects this behavior with the docs (https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html has a table) and by setting up my distro with no-cache. Subsequent requests to a no-cache file respond with x-cache: RefreshHit from cloudfront

I'll submit a PR to the docs.

sidharthachatterjee commented 4 years ago

Looks like this was answered and the (really great) discussion here also resulted in some improvements to docs.

Let's close this for now since it doesn't look like there's anything else here to do. In case I'm wrong, please feel to reopen to continue discussing this.

Thank you everyone for your thoughtful comments!

peabnuts123 commented 4 years ago

Apologies for posting in a long and already-closed thread but I don't think this warrants a new issue. I'm just looking to understand why caching was chosen as the answer to this problem? It seems like a fingerprinted set of page-data.json files would completely solve this problem - stale and fresh page bundles would be kept paired with their page data, and as people's caches expired on index.html, fresh data would be requested. Performance-wise, any files that didn't change would still have the same fingerprint, so only data and pages depending on the data would change when the data was changed.

What problems are being caused here by having fingerprints on the data files?

lili21 commented 4 years ago

like @peabnuts123 said. I don't understand why page-data.json files are not content-hashed either.

peabnuts123 commented 4 years ago

Since I last commented on this issue I've read more into it and it seems that as per the blog post linked above, the motivation for not-hashing page-data.json files is so that deployments made while a user is navigating the site will result in a "live" data refresh (i.e. a user will not have to refresh the page to get the latest version, if they have not already visited a page).

I have to say in my opinion this doesn't seem like desirable behaviour, especially as a trade-off with the ability to cache your website in a CDN. As the structure of the JSON is very tightly-coupled with the JS code bundle for your site, small copy changes may be reflected "live" (on pages the user has not yet visited) but structural changes will cause errors, meaning your site could very well crash while navigating around as a deployment occurs. Solving this problem, I would say, is the primary reason fingerprinting was brought into the mainstream to begin with.

Since this seems to be a decision that was actively made by contributors to the Gatsby project it's unlikely that this will be undone any time soon but it is probably at least worth raising a new issue to perhaps add a config option to choose whether a developer would like to hash page bundles or not.

KyleAMathews commented 4 years ago

Page bundles include the hash for the webpack build. If the data is from a new build, the gatsby site does a hard refresh to get the new data.

peabnuts123 commented 4 years ago

Ah, okay! Then all that's left is basically people willing to trade off this "live" feature for further performance gains i.e. this feature is making it difficult for people to properly cache the page-data.json files.

From my not-so-intimate understanding of caching configuration it seems that a series of headers and caching behaviours must be configured on your edge server in-order to serve these files with high-performance (i.e. caching, geo-replication, etc.) and also the a small lag time for new deployments. It seems like there are a pool of people who are okay to trade-off this "live data" feature in order to have simpler deployments / hosting configurations, is that fair to say?

christoferolaison commented 4 years ago

If anyone else tries this approach, it would be great to add comments on the results gotten.

@harrygreen hanks for sharing your solution!!

We are trying this approach and was close to deploy it but hesitated due to app-data.json not being hashed in the post build script. We are currently investigating further on how we should proceed but I was wondering if you have pushed the post build script to production and if you ran into any issues ?

/C

aychtang commented 4 years ago

@KyleAMathews @peabnuts123 @sidharthachatterjee

I don't understand how you can have consistency while deploying when there is no versioning of page-data.json files. This can cause issues with page content during deployments (especially if there are many pages and you are updating everything in place).

In our deployment system we can describe it with an initial state with two variables pageDataVersion + pages where:

Possible versions are {"A", "B"}

It doesn't matter what order you choose to make updates, as soon as you do anything there will be inconsistency either with:

--

The versioning hack solves this as each page can point to the correct version at all times, however I don't understand how this would work in the suggested approach.

Furthermore, no-cache always requires revalidation AFAIK, which is potentially not what some users want.

antoinerousseau commented 4 years ago

This stale page data makes your website crash if its object structure changed for example. I think they should be versioned back, as proposed (and not just in a plugin), and a try/catch mechanism should trigger a full page reload when the page data is not found anymore.