Closed georgiee closed 5 years ago
We made some progress.
createPagesStatefully
ist the way to go at least for the initial bootstrap to generate most of the pages. There are indeed stateful because we are not going to change them usually.
When a node is updated those pages still update β and only the pages connected with the nodes. I don't know what was wrong in my example.
For any new node we can't create a page for them in onCreateNode
as those pages are the default/dynamic/non-stateful ones.
The hot reloading mechanisms kills every page that isn't touched see https://github.com/gatsbyjs/gatsby/blob/0260f88a43123cfc3b17124c4aba5e11aebc28ea/packages/gatsby/src/bootstrap/page-hot-reloader.js#L43-L52
So we are basically not supposed to use createPage outside the createPages lifecycle as those pages are deleted in the next life cycle round
Idea: For every new node mark them as new
so we can query them (instead of all nodes that already have a page). That way we can have a smaller set of pages we have to query and rebuild.
Well let's make this issue useful for other souls searching for a preview. I will add useful links to articles but mostly source files in Gatsby in this post:
I try to continue/edit this list.
Can't believe this, the initial example setup is wrong:
activity.setStatus(
`Creating ${index + 1} of ${totalPages} total pages`
);
The activity timer drastically slows down the example and gives the illusion createPages is running slow. This doesn't mean that we don't have real performance problems in our build but our whole isolation of the problem and the debugging is based on false facts.
You can mimic this behaviour by dropping this in your createPages:
activity = reporter.activityTimer(`create pages`)
activity.start();
for(let i = 0; i < 1000; i++) {
activity.setStatus(
`[DUMMY] Creating ${i + 1} of ${1000} total pages`
);
}
activity.end();
This will take 7 seconds on my machine just to run the for loop. I found the activity timer as it's being used by the page queries info spinner. The main difference: graphql queries are being reported asynchronous while I'm using a synchronous for loop. Might be worth to raise this as an issue for the reporter/activity functionality
Hiya!
This issue has gone quiet. Spooky quiet. π»
We get a lot of issues, so we currently close issues after 30 days of inactivity. Itβs been at least 20 days since the last update here.
If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!
As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!
Thanks for being a part of the Gatsby community! πͺπ
Let's close this until we have a more specific problem to talk about.
@georgiee would love your insights if you've got a functioning solution -- paying for the Gatsby Preview currently. Encountering issues re: support responsiveness and evaluating building our own solution.
@nadinagerlach Apologies for the issues with support responsiveness. I've gone ahead and responded to all your tickets and taken care of the issue as well! π
Hello @nadinagerlach, we currently focus on getting the actual page implementations done and postponed the work on the preview server. We have had some agile spikes to explore possibilities.
Some things we considered:
The last time I personally worked on our preview server problem was Summer 2019. A lot of things happened since then and maybe some more resources on building a preview server appeared? There is an excellent documentation section about all the internal of Gatsby called Gatsby Internal. Reading that together with the Gatsby Source Code helped a lot β but it would help a lot to have more guidance for building an own preview server as it's such a crucial part for a Gatsby installation beyond a specific size.
I hope you have a better experience and I would be happy to hear about your preview experiences π
that initial description of the issue is already dated, follow the progress in the replies.
Summary
Here comes the essence of this question/issue:
What's following are some explanations around the topic, some details and of course an example project to showcase the problem. Everything we learned and researched comes directly from the excellent docs and source files. 90% of the experiment is based on the awesome work of @DSchau who created almost everything in Gatsby's e2e suite π
That's how the experiment/demo project behaves
You can see that all pages are recreated whenever something is received by the webhook. Here comes the full story:
Relevant information
When talking about large scale environments we are talking about Gatsby Installations with a nodes count beyond 50.000 - 100.000 and the same count of derived pages. Imagine 50.000 News Articles served by some Headless CMS for this summary. Building with Gatsby in such an environment works for most people as timing is acceptable and the upcoming incremental build should help in cases where a single new node should only generate one additional page for example. It gets pretty interesting in Gatsby's Develop Mode with the page-hot-reloader and the webhook with payload functionality being activated (ENABLE_GATSBY_REFRESH_ENDPOINT).
In such an environment, how to handle node updates efficiently in Gatsby's develop server after the bootstrap ? The bootstrap phase itself will take some time to build & reflect the current state of all data which is fine β but how to prevent Gatsby from (re)-creating all pages and running all page queries when only a single node is updated/added/deleted depending on the webhook payload ?
The product Gatsby Preview seems to help some people with that challenge but it's unfortunately not an option when the client's infrastructure is located in a closed network. Hence our current challenge is related to serve a custom preview based on Gatsby's develop server.
To be prepared for the technical challenges, we dug through many package sources including the core of Gatsby itself and we read all of the excellent documentation about the Gatsby Internals. Kudos for that awesome summary! Things are still not 100% clear to us but the mental image is already starting to build up.
We already achieved a working prototype by pinging naively the
__refresh
endpoint with no payload with a handful of nodes and generated pages being processed. This was a really nice experience but when we scaled things up it's gone south. We tried to build 5.000 pages and it took many minutes already (~20min to rebuild all pages after a single node update). There are no images no involved, it's the page processing. I save you the details of that installation and created an isolated experiment instead.Example Project/Experiment
The experiment is based on @DSchau 's work on webhook/fake-source in Gatsby's e2e suite
Here is our example project: https://github.com/satellytes/gatsby-large-scale-preview-experiment
Run it and trigger some different webhook calls from a second session. Check the README for all of our thoughts when we created the experiment. It's somehow overlapping with this issue description but might help clarifying things.
When running the example we create a set of initial nodes (INITIAL_NODES_TO_CREATE) by calling our new method
api.hugeInitialSync
only once in sourceNodes. The existingapi.sync
method is modified to accept a parametersupdateAllNodes: true/false
which will cause all nodes being touched as a fieldupdated
is incremented.When the refresh endpoint is hit we can now decide among those scenarios:
touchAll
in the webhook payload)The problem
Everytime you post to the webhook, every single page is recreated - because we tell Gatsby to do so in
createPages
which is called by the api runner if any page is dirty. It doesn't matter if the payload is empty or filled.See our file gatsby-node.js for full sources.
The
createPages
lifecycle is the idiomatic approach which works during build time and it works for most people (including us with a few pages) also during develop time with the hot reload functionality.As said, with the preview mode activated we trigger an update (with or without a payload) which creates every page and all page queries are run again in addition (this happends later in the lifecycle and also costs quite some time). That makes any small update blocking the development server for minutes depending on your machine and node count. We are unsure how to prevent Gatsby from doing so in the experiment with the fake api source.
Goals:
Here some approaches:
createPages
and callcreatePage
inonCreateNode
instead as it's available through the boundActionCreators.createPagesStatefully
lifecycle hook? We tried that and indeed the pages are not re-created upon refresh as intended; however, all the page queries are re-evaluated nevertheless.createPageDependency
to prevent an updated node to trigger an update for all nodes of the same type ?It would be awesome if we could get a little discussion running around this topic β as this might be of interest for other people working with many pages + gatsby develop server/preview.
Source Insights
We have checked many parts of Gatsby's sources, here some interesting files we dug through:
page-hot-reloader.js You can see that nodes added/deleted set the pagesDirty flag which causes all pages to be created once the api runner has settled.
develop.js That's where the refresh/preview mode (ENABLE_GATSBY_REFRESH_ENDPOINT) is activated. We can see that
sourceNodes
is triggeredutils/source-nodes.js We can see how the api runner is activated. That's when we technically understood why the webhook causes the lifecycle
createPages
to be called.gatsby-source-graphql/src/gatsby-node.js We found
createPageDependency
in the wild (inside a plugin/source) only in thegatsby-source-graphql
plugin.What's happening in the internal redux store can bee seen here packages/gatsby/src/redux. Did not help much. We looked up the things happening around page dependencies.
I'm sorry for the length of the topic. I wanted to provide as many information as possible. I also joined the Discord channel but I think the topic is worth to be discussed in this question issue.
Thanks for reading and I appreciate any input on this topic.