Closed joernroeder closed 2 years ago
Hi, thanks for the issue! As you already pointed out the issue might be related to https://github.com/gatsbyjs/gatsby/issues/33868 and as my colleague also wrote there it might be a good idea to try streaming instead. It's not something we can prioritize at the moment but we'd be happy to review a PR with this. Thanks!
@LekoArts streaming might be the ultimate solution but I feel converting the loop to respect the async nature of createNode
might be a good start and doesn't seem to be to difficult (the code above is all we need) and would fix at least this issue here. I would open a PR to get the fix out (and me to gatsby 4) if you don't mind.
Also just to make it clear, the same amount of nodes (same json file) is currently transformed in the gatsby 3 setup successfully which indicates that something might have changed upstream and causes the issue now.
What changed with v4 is we now write data to LMDB which is more costly than writing to memory. The sync forEach loop doesn't allow LMDB to flush data to disk meaning in progress changes accumulate. Switching to async solves as does using e.g. process.nextTick to let the event loop run through again.
@KyleAMathews ah that makes sense, didn't know LMDB is part of gatsby 4 by default – that's awesome!
Preliminary Checks
Description
I'm running a 10k+ pages instance on gatsby cloud and in a dedicated branch I upgraded gatsby 4 dependencies for a while. Unfortunately I never got it running due to high memory consumption.
Also my local builds on my machine consumed huge amounts of memory until I killed the process.
Long story short, today I had some time to dig into the issue, commented out all my plugins and isolated the issue. The source of the issue is this line https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-transformer-json/src/gatsby-node.js#L63
A
.forEach
loop over the array items of a json file to create the individual json nodes. Whats the issue you may ask?Gatsbys createNode function which is used by
transformObject
returns awhereas
.forEach
The result is that the loop kicks off all 10k transformations right away instead of in some serial/concurrent loop which the code suggests.
The fix for that seems quite simple:
transformObject
as anasync function
awaits
thetransformObject
callI feel you could get fancy here and use something like .eachLimit to add some concurrency and speed it up again.
Reproduction Link
https://github.com/joernroeder/gatsby-json-memory
Steps to Reproduce
gatsby start
Expected Result
no multi GB memory usage :)
Actual Result
broken builds
Environment
Config Flags
No response