gatsbyjs / gatsby

The best React-based framework with performance, scalability and security built in.
https://www.gatsbyjs.com
MIT License
55.27k stars 10.31k forks source link

[Gatsby Develop] Building a performant preview server (>10k nodes with dependent pages) #16616

Closed georgiee closed 5 years ago

georgiee commented 5 years ago

that initial description of the issue is already dated, follow the progress in the replies.


Summary

Here comes the essence of this question/issue:

In Gatsby's develop server with the refresh webhook endpoint enabled we create all pages during bootstrap phase to reflect the current state of our external store/CMS. How can we add/update/remove single pages without relying on createPages lifecycle as this would iterate over all pages first and later run all page queries ignoring the fact that almost nothing changed. Even an empty webhook/refresh call can cause a rebuild time of many minutes.

Could createPagesStatefully(to create all pages once during bootstrap) together with onCreateNode (for any successive update on the nodes, call createPage for example) be a viable approach?

What's following are some explanations around the topic, some details and of course an example project to showcase the problem. Everything we learned and researched comes directly from the excellent docs and source files. 90% of the experiment is based on the awesome work of @DSchau who created almost everything in Gatsby's e2e suite πŸ‘Œ

That's how the experiment/demo project behaves bash

You can see that all pages are recreated whenever something is received by the webhook. Here comes the full story:

Relevant information

When talking about large scale environments we are talking about Gatsby Installations with a nodes count beyond 50.000 - 100.000 and the same count of derived pages. Imagine 50.000 News Articles served by some Headless CMS for this summary. Building with Gatsby in such an environment works for most people as timing is acceptable and the upcoming incremental build should help in cases where a single new node should only generate one additional page for example. It gets pretty interesting in Gatsby's Develop Mode with the page-hot-reloader and the webhook with payload functionality being activated (ENABLE_GATSBY_REFRESH_ENDPOINT).

In such an environment, how to handle node updates efficiently in Gatsby's develop server after the bootstrap ? The bootstrap phase itself will take some time to build & reflect the current state of all data which is fine β€” but how to prevent Gatsby from (re)-creating all pages and running all page queries when only a single node is updated/added/deleted depending on the webhook payload ?

The product Gatsby Preview seems to help some people with that challenge but it's unfortunately not an option when the client's infrastructure is located in a closed network. Hence our current challenge is related to serve a custom preview based on Gatsby's develop server.

To be prepared for the technical challenges, we dug through many package sources including the core of Gatsby itself and we read all of the excellent documentation about the Gatsby Internals. Kudos for that awesome summary! Things are still not 100% clear to us but the mental image is already starting to build up.

We already achieved a working prototype by pinging naively the __refresh endpoint with no payload with a handful of nodes and generated pages being processed. This was a really nice experience but when we scaled things up it's gone south. We tried to build 5.000 pages and it took many minutes already (~20min to rebuild all pages after a single node update). There are no images no involved, it's the page processing. I save you the details of that installation and created an isolated experiment instead.

Example Project/Experiment

The experiment is based on @DSchau 's work on webhook/fake-source in Gatsby's e2e suite

Here is our example project: https://github.com/satellytes/gatsby-large-scale-preview-experiment

Run it and trigger some different webhook calls from a second session. Check the README for all of our thoughts when we created the experiment. It's somehow overlapping with this issue description but might help clarifying things.

INITIAL_NODES_TO_CREATE=1000 yarn develop

yarn webhook:full-sync
yarn webhook:new-item
yarn webhook:webhook-empty

When running the example we create a set of initial nodes (INITIAL_NODES_TO_CREATE) by calling our new method api.hugeInitialSync only once in sourceNodes. The existing api.sync method is modified to accept a parameters updateAllNodes: true/false which will cause all nodes being touched as a field updated is incremented.

When the refresh endpoint is hit we can now decide among those scenarios:

  1. add new items (through the webhook, already present in the e2e project)
  2. touch all nodes and create a new node from inside (triggered by a new flag touchAll in the webhook payload)
  3. do nothing

The problem

Everytime you post to the webhook, every single page is recreated - because we tell Gatsby to do so in createPages which is called by the api runner if any page is dirty. It doesn't matter if the payload is empty or filled.

  const { data } = await graphql(`
    {
      allFakeData {
        nodes {
          title
          fields {
            slug
          }
        }
      }
    }
  `)

  data.allFakeData.nodes.forEach((node, index) => {
    createPage({})
    //...

See our file gatsby-node.js for full sources.

The createPages lifecycle is the idiomatic approach which works during build time and it works for most people (including us with a few pages) also during develop time with the hot reload functionality.

As said, with the preview mode activated we trigger an update (with or without a payload) which creates every page and all page queries are run again in addition (this happends later in the lifecycle and also costs quite some time). That makes any small update blocking the development server for minutes depending on your machine and node count. We are unsure how to prevent Gatsby from doing so in the experiment with the fake api source.

Goals:

Here some approaches:

It would be awesome if we could get a little discussion running around this topic β€” as this might be of interest for other people working with many pages + gatsby develop server/preview.

Source Insights

We have checked many parts of Gatsby's sources, here some interesting files we dug through:

I'm sorry for the length of the topic. I wanted to provide as many information as possible. I also joined the Discord channel but I think the topic is worth to be discussed in this question issue.

Thanks for reading and I appreciate any input on this topic.

georgiee commented 5 years ago

We made some progress.

Idea: For every new node mark them as new so we can query them (instead of all nodes that already have a page). That way we can have a smaller set of pages we have to query and rebuild.

georgiee commented 5 years ago

Well let's make this issue useful for other souls searching for a preview. I will add useful links to articles but mostly source files in Gatsby in this post:

I try to continue/edit this list.

georgiee commented 5 years ago

Can't believe this, the initial example setup is wrong:

activity.setStatus(
   `Creating ${index + 1} of ${totalPages} total pages`
);

The activity timer drastically slows down the example and gives the illusion createPages is running slow. This doesn't mean that we don't have real performance problems in our build but our whole isolation of the problem and the debugging is based on false facts.

You can mimic this behaviour by dropping this in your createPages:

activity = reporter.activityTimer(`create pages`)
activity.start();

for(let i = 0; i < 1000; i++) {
    activity.setStatus(
      `[DUMMY] Creating ${i + 1} of ${1000} total pages`
    );
  }
  activity.end();

This will take 7 seconds on my machine just to run the for loop. I found the activity timer as it's being used by the page queries info spinner. The main difference: graphql queries are being reported asynchronous while I'm using a synchronous for loop. Might be worth to raise this as an issue for the reporter/activity functionality

gatsbot[bot] commented 5 years ago

Hiya!

This issue has gone quiet. Spooky quiet. πŸ‘»

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! πŸ’ͺπŸ’œ

georgiee commented 5 years ago

Let's close this until we have a more specific problem to talk about.

nadinagray commented 4 years ago

@georgiee would love your insights if you've got a functioning solution -- paying for the Gatsby Preview currently. Encountering issues re: support responsiveness and evaluating building our own solution.

sidharthachatterjee commented 4 years ago

@nadinagerlach Apologies for the issues with support responsiveness. I've gone ahead and responded to all your tickets and taken care of the issue as well! πŸ™‚

georgiee commented 4 years ago

Hello @nadinagerlach, we currently focus on getting the actual page implementations done and postponed the work on the preview server. We have had some agile spikes to explore possibilities.

Some things we considered:

The last time I personally worked on our preview server problem was Summer 2019. A lot of things happened since then and maybe some more resources on building a preview server appeared? There is an excellent documentation section about all the internal of Gatsby called Gatsby Internal. Reading that together with the Gatsby Source Code helped a lot β€” but it would help a lot to have more guidance for building an own preview server as it's such a crucial part for a Gatsby installation beyond a specific size.

I hope you have a better experience and I would be happy to hear about your preview experiences πŸ™