gatsbyjs / gatsby

The best React-based framework with performance, scalability and security built in.
https://www.gatsbyjs.com
MIT License
55.2k stars 10.33k forks source link

[Request] Real-world Gatsby sites (50k+ pages) #19512

Closed pvdz closed 4 years ago

pvdz commented 4 years ago

Hello kind Gatsby user 👋

My name is Peter, and I’m a Gatsby employee focused on performance and scalability.

We at Gatsby are always looking for ways to improve the performance of building out your Gatsby applications and making Gatsby not only scalable to building out hundreds of thousands of pages (or more!) but making this process as lightning quick as the resulting Gatsby application.

To best support this endeavor, we need your help! We have benchmarks in the repo, but they tend to be quite contrived and not necessarily indicative of real-world usage of Gatsby.

Specifically, we're looking for sites that:

For this first batch, we'll be using these real-world applications to identify low-hanging fruit as it relates to performance so we can make the Gatsby build process ever faster and ever more scalable.

Does this sound like you? Please share a link to your application's source code below or e-mail any necessary details to peter at gatsbyjs dot com. We appreciate you 💜

Thanks! Onwards and upwards 📈🚀

eads commented 4 years ago

Hi there! Yes I'd very much like to help. My biggest Gatsby site has ~24k pages but I figure it's still a pretty decent one to take a look at. You can see it live at https://govbook.chicagoreporter.com/en/ and the code is open source at https://github.com/thechicagoreporter/govbook

It runs on Gatsby v2. No secrets required, and we track the source data in the repo for now, so you shouldn't even need to pull down fresh data. It does depend on having SQLite installed; I don't think anything else is required other than the standard Gatsby dependencies.

Also tagging in @artfulaction who is contracted to work on the project through at least the end of the year.

I've had a helluva time getting it to build on Netlify and AWS Amplify, so that's been a persistent issue. Thus far to locally develop the site I just limit the size of the query manually which isn't ideal either.

The database that drives it is about 8k rows, and there's a row per page. How is it over 24k pages then? Because there's a spanish version, an english version, and the redirect page. Any language we add will add another 8k pages to the build, and we're very much hoping to get a few more languages (especially Mandarin and Polish) into the site in the next year or so.

Longer term, we hope to bake out VCF files for each of the contacts in the database, so that will add tens of thousands of additional files as well and should represent an interesting use case for the Gatsby toolchain.

Thanks for doing this :clap:

pvdz commented 4 years ago

@eads this is a great example of what I'm looking for. Thank you :)

Swizec commented 4 years ago

Hi,

I don't quite fit your requirements, but I do have a Gatsby build that takes upwards of 40 minutes on my local machine, crashes Zeit, and makes Netlify choke. Both also choke when I try to upload the resulting static package of 18,000+ files.

It's been fun :D

Right now Gatsby's cloud service comes the closest to working.

Here's the repo and specific branch: https://github.com/Swizec/swizec-blog/tree/many-posts

It's about 1400 pages in all, but the image and embed processing kills everything. Even have to increase Node's heap size to make the build survive.

You can't see it live anywhere because I'm trying to avoid having to setup my own VPS+CDN and such. That was one of my original motivations for moving to Gatsby in the first place – hoping for an easy way to host+setup with modern tools.

pvdz commented 4 years ago

@Swizec thanks! Your site may not have as many pages but those images sure keep the cores spinning. Not sure how much we can improve on that since image processing is simply expensive. However, we do see some problems with the social plugin, and some room for improvement on fetching external resoures. Thank you :)

pvdz commented 4 years ago

cc @brod-ie @ashtonsix You've mentioned in other issues that you have mega big sites. Any chance we could get a slice of that for benchmarking? It could ultimately help your site as well.

rjyo commented 4 years ago

Hi @pvdz, Gatsby is really nice and I would like to provide an example here.

It has around 48K pages with 200K+ records and growing.

With the queries optimized, it builds under 5 minutes with just gatsby build on a 9700K with 32GB RAM.

success open and validate gatsby-configs - 0.161s
success load plugins - 2.368s
success onPreInit - 0.008s
success delete html and css files from previous builds - 0.007s
success initialize cache - 0.006s
success copy gatsby files - 0.038s
success onPreBootstrap - 0.824s
success loading DatoCMS content - 4.214s
success source and transform nodes - 4.593s
warn On types with the `@dontInfer` directive, or with the `infer` extension set to `false`, automatically adding fields for children types
is deprecated.
In Gatsby v3, only children fields explicitly set with the `childOf` extension will be added.
success building schema - 0.486s
success createPages - 39.880s
success createPagesStatefully - 0.052s
success onPreExtractQueries - 0.005s
success update schema - 22.360s
success extract queries from components - 0.395s
success write out requires - 0.075s
success write out redirect data - 0.001s
success Build manifest and related icons - 0.072s
success onPostBootstrap - 0.105s
⠀
info bootstrap finished - 75.080 s
⠀
success Building production JavaScript and CSS bundles - 11.084s
success Rewriting compilation hashes - 0.002s
success run queries - 72.590s - 46529/46529 640.98/s
success Building static HTML for pages - 87.670s - 46526/46526 530.70/s
info Done building in 246.738346453 sec

Everything goes on smoothly until this morning when we upgraded to 2.18+, the build speed dropped dramatically when running batched graphql in gatsby-node.js. Still investigating the reason. (And that's why I saw this request.)

Thank you all for the great Gatsby!

asilgag commented 4 years ago

Hi @pvdz and @sidharthachatterjee,

Coming here from https://github.com/gatsbyjs/gatsby/issues/19718. Glad to help on improving Gatsby performance!

Our case scenario:

This is the build log we get:

success open and validate gatsby-configs - 0.786s
success load plugins - 0.798s
success onPreInit - 0.059s
success delete html and css files from previous builds - 0.010s
success initialize cache - 0.008s
success copy gatsby files - 0.019s
success onPreBootstrap - 0.003s
success source and transform nodes - 2.530s
success building schema - 1.080s
success createPages - 20.736s
success createPagesStatefully - 0.053s
success onPreExtractQueries - 0.001s
success update schema - 0.025s
success extract queries from components - 0.316s
success write out requires - 0.072s
success write out redirect data - 0.002s
success Build manifest and related icons - 0.027s
success onPostBootstrap - 0.059s
info bootstrap finished - 28.844 s

success Building production JavaScript and CSS bundles - 7.128s
success Rewriting compilation hashes - 0.002s
success run queries - 34.002s - 19563/19563 575.36/s
success Building static HTML for pages - 37.959s - 19555/19555 515.15/s

info Done building in 105.479798822 sec

We would like to improve the following:

All in all, we would like some kind of flag to make Gatsby work as a "fully static site generator". I mean:

I know this could sound stupid: "why turn Gatsby into a traditional SSR like Hugo or Jekyll?". Well, apart from solving our scaling issues with AMP, I can't imagine working without React components, even if they are only used to generate static HTML without any further JS interactivity. Hugo and Jekyll are fine, but React's simplicity and working with components are key for us (and for lot of people, I think).

I can't publicly share any further detail here, but I'll reach you by email with more details.

Thanks!

ganapativs commented 4 years ago

I had a huge problem with scalability with Gatsby earlier.

Issue: https://github.com/gatsbyjs/gatsby/issues/17233

I had to switch to Next.js because of this. Happy to see that Gatsby team is prioritizing scalability 👍

pvdz commented 4 years ago

@rjyo the regression came with the shadowing feature that landed a few days ago. We're looking into the regression and how to best mitigate it. I don't suppose I could build your site myself for benchmarking purposes? :) Thanks for the feedback!

@asilgag we kind of need the page-data.json per page, if nothing else, for later parallalization. Each page becomes an individual job and that way we would be able to spread the load on multiple cores, something we can't do just yet right now. We should be able to improve the situation though. And if you don't save page-data.json to disk you'd have to retain it in memory, which certainly does not scale for most people (although some can certainly just throw money at it). I will take your suggestions into consideration when contemplating next steps into scaling perf and get back to you on them. Thank you!

rjyo commented 4 years ago

@pvdz I just upgrade to 2.18.4 and the performance regression has gone! The createPage took 20s more than 2.17.x builds and updateSchema's time went down from 20s+ to less than 1s. i.e. The sum is quite steady.

Thanks for your information!

pvdz commented 4 years ago

@ganapativs Sorry to hear that! I am definitely interested in your case and will be looking into it, regardless. Thanks for the test case :)

rjyo commented 4 years ago

@pvdz After running on 2.18.4 with dozens of hourly builds on CI, around 50% of the builds failed on createPages

...
error "gatsby-node.js" threw an error while running the createPages lifecycle:
Cannot read property 'rocket' of null
  TypeError: Cannot read property 'rocket' of null
...

where rocket should be returned from the GraphQL request. note: There are a bunch of queries running in createPages, and most had already finished without any problem.

Redo the job will, again, have a 50% about success rate.

Hope you guys can find the problem. Please contact me directly were there any debug info I can provide.

Thanks!

pvdz commented 4 years ago

@rjyo that doesn't sound good. Can you open a new issue (if not already done so) for this? And try it on 2.18.5 ? This may contain a fix that could have already fixed your problem.

rjyo commented 4 years ago

@pvdz Thanks! I just tried 2.18.5 and the first attempt just went on well. The build time is quite similar to those of 2.17.x. Less time on createPage and what it takes on updateSchema just comes back now.

I'll let it run for some more and let your know the results.

Thanks again!

pvdz commented 4 years ago

Glad to hear that :) I'm working on keeping better tabs on scaling performance regressions. Please do feel free to ping when you see something regress unexpectedly. That goes for anyone.

pvdz commented 4 years ago

@eads good news! If you weren't using the CI=true flag yet, you're going to get an even better build time :D If you are using it already, well, good :) I'm changing the logger which drops the govbook build time from 210s to 140s for me locally :D ( https://github.com/gatsbyjs/gatsby/pull/19866 )

For anyone else; This PR affects the progress bar so if you were testing large sites with default settings, you should get a perf win as well.

Note that if you're building in a ci then setting CI=true is a good idea. It'll reduce log spam. After the aforementioned PR gets merged it won't matter much anymore in terms of Gatsby perf.

prashant1k99 commented 4 years ago

Hi @pvdz, Gatsby is really nice and I would like to provide an example here.

It has around 48K pages with 200K+ records and growing.

With the queries optimized, it builds under 5 minutes with just gatsby build on a 9700K with 32GB RAM.

success open and validate gatsby-configs - 0.161s
success load plugins - 2.368s
success onPreInit - 0.008s
success delete html and css files from previous builds - 0.007s
success initialize cache - 0.006s
success copy gatsby files - 0.038s
success onPreBootstrap - 0.824s
success loading DatoCMS content - 4.214s
success source and transform nodes - 4.593s
warn On types with the `@dontInfer` directive, or with the `infer` extension set to `false`, automatically adding fields for children types
is deprecated.
In Gatsby v3, only children fields explicitly set with the `childOf` extension will be added.
success building schema - 0.486s
success createPages - 39.880s
success createPagesStatefully - 0.052s
success onPreExtractQueries - 0.005s
success update schema - 22.360s
success extract queries from components - 0.395s
success write out requires - 0.075s
success write out redirect data - 0.001s
success Build manifest and related icons - 0.072s
success onPostBootstrap - 0.105s
⠀
info bootstrap finished - 75.080 s
⠀
success Building production JavaScript and CSS bundles - 11.084s
success Rewriting compilation hashes - 0.002s
success run queries - 72.590s - 46529/46529 640.98/s
success Building static HTML for pages - 87.670s - 46526/46526 530.70/s
info Done building in 246.738346453 sec

Everything goes on smoothly until this morning when we upgraded to 2.18+, the build speed dropped dramatically when running batched graphql in gatsby-node.js. Still investigating the reason. (And that's why I saw this request.)

Thank you all for the great Gatsby!

@rjyo Can you share the running site url, I am really curious about the website...

pvdz commented 4 years ago

@prashant1k99 that sounds like #5002 :)

muescha commented 4 years ago

@pvdz Look at #9083 there are also 2 users with large pages:

crock commented 4 years ago

I have a Gatsby site that's currently not live as I'm still trying to work out if Gatsby is gonna work out because I have 200k+ rows in a MySQL database and each row would be a single page.

Is this a site you would want to use? It's relatively simple. It is a Twitch.tv clip aggregator that just embeds an iframe on each page along with a comment system.

pvdz commented 4 years ago

@crock Yeah absolutely! Can you post the build durations (for each step) you're currently getting?

disintegrator commented 4 years ago

We don't have a large number of pages but we do have a large number of nodes in our graph which is killing our build performance at the stage where Gatsby is building the GraphQL schema. I've described the problem in greater depth here: https://github.com/gatsbyjs/gatsby/issues/20197

pvdz commented 4 years ago

We were able to triage @disintegrator 's problem down to "unnecessary" inference and creating a type schema for the context dropped the biggest build step (type inference) from 5 minutes down to 11 seconds. See that issue for more details.

This is something we probably want to try and automate (detect, warn, auto-create schema, win)

muescha commented 4 years ago

@pvdz Look at #5002 there are also 1 user with large pages:

pvdz commented 4 years ago

For anyone still tracking this. I haven't forgotten! It just takes time. I'm still interested in large example sites. This helps me to uncover problems that smaller sites don't exhibit. Generally this is "big oh" stuff, but that's just the low hanging fruit.

Solving these problems helps us to improve the build pipeline in general. This in turn helps you.

It's not us that it helps, though, because it might also help you in particular! Here are some examples of the direct impact of this effort so far (I might edit these into the top post for visibility);

Not reported here but related to scaling up images:

Type inference bottlenecks:

Many page site with no external deps:

With that we can currently build a site with 200k+ pages in like 5 minutes with Gatsby. Images do bog this down, as is inherent to images. And you can always do things that throws you off the happy perf path (like pass in a lot of data through context without using type schemas).

So, please, keep showing me your sites at scale. I can't promise you I'll have time for it immediately but I can promise you that I'll take a look once I have the time. And who knows what it might fix.

Please chime in if you've noticed your scaling site perf has improved (or regressed?).

If build times remains a problem, have you tried our new Builds service? :)

Things on my shortlist in no particular order;

Thanks for sharing everyone :100:

eads commented 4 years ago

@pvdz I'm still following! Many thanks for all your work on this. I'm excited to try out the improvements in the coming weeks and will report back. Maybe my site will finally build!

pvdz commented 4 years ago

@eads it's not building? I've been using that site to benchmark certain things for a while. It should build with little problems. There's plenty of room for improvement, like adding a graphql schema and not passing the entire data structure through the context, using static queries, etc. But as it is, it should run fine. And with my latest fix you wouldn't even need to switch to filter by id (it would still be slightly faster, but that delta is minor now).

moroshko commented 4 years ago

Hey @pvdz,

I have a fairly small (<100 pages) site where gatsby build (both locally and in CI) fails on Node 10.16.3 with:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Repo is here and failing build is here.

Any hints to what may cause the failure?

pvdz commented 4 years ago

@moroshko that sounds interesting. 100 pages shouldn't trigger that, although it might depending on what's in there. Have you tried expanding the available memory? (node --max_old_space_size=2000 node_modules/.bin/gatsby build) I'll try to have a look at it this week.

moroshko commented 4 years ago

@pvdz Increasing the available memory helped the build to pass locally. But, I couldn't find how to increase the memory in CI (GitHub actions).

pvdz commented 4 years ago

@moroshko your problem is related to webpack. In particular minification and in very particular sourcemapping. For example, if you run gatsby build --no-uglify it passes just fine and the build completes in ~60s (40s is for webpack, that's the "building production JS and CSS bundles"-step).

If I look at the public folder afterwards I see a bunch of 2.5mb JS files. Those are definitely the source of this problem. Quesiton is now, why. The big file names seem to match the components in the siteMetadata in gatsby-config.js. My guess is webpack is somehow creating chunks which include "all the things", for whatever reason. Can you see if you can solve it that way? Please circle back to us if you think this is a problem within Gatsby (open a new ticket so we can track it properly).

-rw-rw-r--  1 2731406 Feb 13 20:07 component---src-pages-components-button-index-js-771ab1f296f72e990cbb.js
-rw-rw-r--  1 3200061 Feb 13 20:07 component---src-pages-components-button-index-js-771ab1f296f72e990cbb.js.map
-rw-rw-r--  1 6292 Feb 13 20:07 component---src-pages-components-button-resources-mdx-bb7e80fcbb5680269f34.js
-rw-rw-r--  1    3435 Feb 13 20:07 component---src-pages-components-button-resources-mdx-bb7e80fcbb5680269f34.js.map
-rw-rw-r--  1    6267 Feb 13 20:07 component---src-pages-components-button-usage-mdx-dff9888c164b4bd2d48b.js
-rw-rw-r--  1    3398 Feb 13 20:07 component---src-pages-components-button-usage-mdx-dff9888c164b4bd2d48b.js.map
-rw-rw-r--  1 2731772 Feb 13 20:07 component---src-pages-components-checkbox-index-js-2b6ff181a939e025c46c.js
-rw-rw-r--  1 3200643 Feb 13 20:07 component---src-pages-components-checkbox-index-js-2b6ff181a939e025c46c.js.map
-rw-rw-r--  1    6299 Feb 13 20:07 component---src-pages-components-checkbox-resources-mdx-a4404e129d9bcd1ac5a8.js
-rw-rw-r--  1    3443 Feb 13 20:07 component---src-pages-components-checkbox-resources-mdx-a4404e129d9bcd1ac5a8.js.map
-rw-rw-r--  1    6274 Feb 13 20:07 component---src-pages-components-checkbox-usage-mdx-73fb39d8b1900a45b135.js
-rw-rw-r--  1    3406 Feb 13 20:07 component---src-pages-components-checkbox-usage-mdx-73fb39d8b1900a45b135.js.map
-rw-rw-r--  1 2732963 Feb 13 20:07 component---src-pages-components-container-index-js-a75b142fb6ebe61f7e32.js
-rw-rw-r--  1 3198189 Feb 13 20:07 component---src-pages-components-container-index-js-a75b142fb6ebe61f7e32.js.map
-rw-rw-r--  1    6302 Feb 13 20:07 component---src-pages-components-container-resources-mdx-b4610919ab8ddbc64b94.js
-rw-rw-r--  1    3447 Feb 13 20:07 component---src-pages-components-container-resources-mdx-b4610919ab8ddbc64b94.js.map
-rw-rw-r--  1    6277 Feb 13 20:07 component---src-pages-components-container-usage-mdx-a830382f44eead740053.js
-rw-rw-r--  1    3410 Feb 13 20:07 component---src-pages-components-container-usage-mdx-a830382f44eead740053.js.map
-rw-rw-r--  1 2731169 Feb 13 20:07 component---src-pages-components-date-picker-index-js-fd3831a07f5212003dd9.js
-rw-rw-r--  1 3199955 Feb 13 20:07 component---src-pages-components-date-picker-index-js-fd3831a07f5212003dd9.js.map
etc
moroshko commented 4 years ago

@pvdz Thanks for looking into this!

Can you see if you can solve it that way?

I'm not sure what are you suggesting here. My understanding is that Gatsby is using webpack internally and it's up to Gatsby to do use webpack in the most optimized way.

jacobrienstra commented 4 years ago

Hey @pvdz, thank you so much for helping with these cases!!

I have a site I'd love get some help with. It's only about 2k pages, and right now I'm only using 100 test pages, but it still takes ~2min build time (locally), sometimes more (on Gatsby Cloud—idk). Here's the repo. The data source is at api.satirev.org, which is a DigitalOcean droplet with the Directus api set up. It's the tiniest size they have, could that have something to do with the slow speed? UPDATE: I tested this with a local ddev setup and had the same result. So I don't think it's network problems. See below:

Screen Shot 2020-02-15 at 8 18 00 PM

I feel like at this size, I should be able to build pretty lickety splickety on re-runs, but for some reason, running queries and js/css bundling seem to be going real slow.

UPDATE: changed my page queries to eq: $id instead of the $dataID and it sped things up a bit! Though apparently that isn't supposed to be the case anymore.

pvdz commented 4 years ago

@jacobrienstra the breakdown seems to be; 30s webpack (flat cost), 90s images (they are expensive, period), and 30s queries. The rest is change.

Though apparently that isn't supposed to be the case anymore.

You're kind of right so I checked the history of your repo:

query FullArticle($dataId: Int!) { dataArticle(dataId: { eq: $dataId }) {

Perhaps the Int part is failing to hit the heuristic. I don't know how that gets translated down the line, but the heuristic requires a plain string, number, or boolean as the eq type (these JS types may or may not map directly to graphql types though). Anyways, good to hear switching to ID helped. What is the query time now? I wouldn't expect it to make a huge impact on 100 pages.

30s for webpack is still kind of long. Can you try without minification (gatsby build --no-uglify)? If that makes a huge difference then webpack is creating huge JS files (you can see them in ./public afterwards) which may be a source of problems as well.

The images are trickier. Some preprocessing could help, like if they're huge files you can shrink them one time to the target size, that way sharp doesn't have to needlessly process megabytes of imagery every time. In the end, images are expensive. It is what it is.

pvdz commented 4 years ago

@jacobrienstra I'll look into the filter problem. Seems that int! should just be a number by the time it reaches the heuristic, so that should not be a reason to ignore it. One other reason I can think of is a miss, like when dataId doesn't exist or something. Anyways, I'll add it to my todo-list. (Would be great if you could turn this into a fully local repro!) Cheers

jacobrienstra commented 4 years ago

Thank you @pvdz !!! Yeah I'm not sure about the dataId thing. There shouldn't be any misses, the dataId is the id given to them in the database, so it should always be there. Perhaps there's something wonky about graphql typing, idk.

--no-uglify doesn't seem to make a difference, I remember trying that. I am going to try to reduce the number of dependencies and do as much myself as I can, but it does seem a long time to build js. The only big dep I have is material-ui

On a fresh build I got 27s/32s/76s for webpack, queries, and images respectively. On subsequent builds I got 24s/24s for webpack and queries. Which is better!

I don't actually mind the longer image build time, because they persist in the cache and I don't have to do them every time. The goal is to get it so that when a user publishes a new article and triggers a build, it'll be up on the site as quickly as possible. I think since all the images are cached it'll be fine. And I do plan to do preprocessing, I'm migrating everything from an old Drupal site so I can do it then, and I think any new images are preprocessed by Directus at least down to a certain size.

Any other ideas as to how to reduce the queries runtime? I wonder if it's something to do with the source plugin, perhaps it's not batching things it could be? It seems slow for 100ish queries, at least compared to the benchmarks I ran.

Oh! And yes I did move it to its own repo lol idk why it was still in that messy one https://github.com/jacobrienstra/satirev.org-gatsby

giupas commented 4 years ago

Hi @pvdz,

I manage a couple of news websites with over >500k contents and looking to use Gatsby as staticizer. Right now we're using a custom solution that staticize a single content via API. This has the advantage that a single page can be online in less than a second, a critical requirement for news websites. On the other end, changing templates requires a very slow republish of each content.

Gatsby could be a great solution, but rebuilding 500k pages at each new content or template change is not an option. Even the solution of incremental data changes, as reported on the site https://www.gatsbyjs.org/docs/page-build-optimizations-for-incremental-data-changes/, is not valid as it requires a query on all of the contents to check which one has changed. On news websites you could have even multiple changes in one second, such query would require a massive data exchange.

If Gatsby has something to generate single pages via API i think it could be used for news websites and I could also help with the testing.

Thanks a lot!

mikaelmoller commented 4 years ago

Hi @pvdz - we are building a large enterprise site on Gatsby and are experiencing incredibly long build times. We know that some of the time is due to a big chunk of content, but we are desperate to find the root cause and get it fixed. Do you have the time and are you up for this challenge? :) And what would you need from our side to initially be able to understand our setup and do an analysis?

Any help is highly appreciated! :)

pvdz commented 4 years ago

@giupas @mikaelmoller Hey, thanks for your messages. Sorry for taking so long to respond, it's been a little weird the past two weeks and some github notifications slipped through.

@giupas this is more a question for Cloud or Builds. Somebody will reach out in private about this, I think we can make this work! :)

@mikaelmoller I can triage it. First what I need is a build output, so I can see which parts require the most time. Then an example of the gatsby-node and a template, to see what kind of queries you're running and how you're passing on data. What kind of site is it? Markdown, mdx, something else? Have you tried the usual suspects? Things like adding a graphql scheme to prevent type inference, putting as little data in the context as possible, precomputing images, etc? Best would be if I can just look, or even locally build, the site.

pvdz commented 4 years ago

gatsby@2.20.9 contains https://github.com/gatsbyjs/gatsby/pull/22574 which should improve performance for sites with many nodes that use queries containing multiple eq filters.

Before this optimization was only applied to queries with single eq filters. I'm in the process of also adding support for other operators.

gerardoboss commented 4 years ago

I have a 57K database in Mysql, but I only manage to create 22k pages try to get more memory for gatsby, but is always the same, do you think is a limit with mysql for returning rows?

xmflsct commented 4 years ago

@pvdz I have a site with much less pages using gatsby-source-contentful but already crashes Zeit/Vercel. Maybe you want to take a look? #23463

crock commented 4 years ago

I have a 57K database in Mysql, but I only manage to create 22k pages try to get more memory for gatsby, but is always the same, do you think is a limit with mysql for returning rows?

@gerardoboss I've had issues with gatsby-source-mysql in the past when dealing with very large datasets. It timeouts after a while. The best option is to write a custom source plugin and break up the sql queries into smaller ones.

gerardoboss commented 4 years ago

@crock thank you so much, Im thinking maybe go with a CSV I try it to brake it in queries of 10K records, but the result is exactly the same, dont know where to look for the problem, there is no error o log that tells me what is wrong, if it was an error or time out.

So I will try to a CSV, probably need to convert to json or something need to check.

Thank you so much. I'll update if I am able to do it.

pvdz commented 4 years ago

@gerardoboss have you tried to give the nodejs process more memory? You can do something like node --max_old_space_size=4000 node_modules/.bin/gatsby build to bump the memory available to nodejs which you'll need to do for larger sites. How much you need really depends on your setup and is different for every site. Generally for 50k sites I'd expect 2gb to 4gb to be enough. If you have a public repo I can checkout I can take a look.

@xmflsct I see you were able to resolve it, great! :) Fwiw, the contentful plugin adds a lot of internal nodes (the core unit of information inside Gatsby) which is resulting in scaling problems. I've seen sites with 15k pages rack up over a million internal nodes because it was creating a node for each piece of text in Contentful. I have no concrete way forward here, but that's been my observation.

gerardojaras commented 4 years ago

@pvdz it worked flawlessly! Thanks a lot!

Screen Shot 2020-04-28 at 12 36 16 PM

muescha commented 4 years ago

should be added note at the troubleshooting page https://www.gatsbyjs.org/docs/troubleshooting-common-errors/ about max_old_space_size?

pvdz commented 4 years ago

Going to close this issue. Thanks everyone who participated. Your contributions have made a great impact to the perf of Gatsby :d

Feel free to keep posting large sites (public repo, something I can build locally). The ones so far serve as excellent benchmarks.

At this point my definition of large sites are 100k to 1m page sites. Although it's more accurate to speak in terms of internal node count, which is around 1 million. You can see the node counts by running gatsby build --verbose. The node counts will be printed during bootstrap. (Page nodes separately shortly after). A page with 1 million nodes builds in roughly 20 to 60 minutes, depending on sourcing, plugins, and type of website.

So a large site will have a million+ nodes internally and I'm still working on raising that ceiling :)

Be well. Reach out if you need help.

daiky00 commented 4 years ago

@pvdz I need help with my build times my site is slow and it just has like 1200 pages but it does contain around 12k images. Can you please help me

KyleAMathews commented 4 years ago

@daiky00 have you tried Gatsby Cloud btw? It speeds up processing large numbers of images a lot by parallelization across cloud functions and better caching between builds