gatsbyjs / gatsby

The best React-based framework with performance, scalability and security built in.
https://www.gatsbyjs.com
MIT License
55.26k stars 10.31k forks source link

gatsby-transform-json out of memory #34081

Closed joernroeder closed 2 years ago

joernroeder commented 2 years ago

Preliminary Checks

Description

I'm running a 10k+ pages instance on gatsby cloud and in a dedicated branch I upgraded gatsby 4 dependencies for a while. Unfortunately I never got it running due to high memory consumption.

Your Gatsby build's memory consumption exceeded the limits allowed in your plan. 
For more details, see https://gatsby.dev/memory.

Also my local builds on my machine consumed huge amounts of memory until I killed the process.

Screenshot 2021-11-25 at 17 32 36

Long story short, today I had some time to dig into the issue, commented out all my plugins and isolated the issue. The source of the issue is this line https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-transformer-json/src/gatsby-node.js#L63

A .forEach loop over the array items of a json file to create the individual json nodes. Whats the issue you may ask?

Gatsbys createNode function which is used by transformObject returns a

Promise [which] resolves when all cascading onCreateNode API calls triggered by createNode have finished.

whereas .forEach

Note: forEach expects a synchronous function.

forEach does not wait for promises. Make sure you are aware of the implications while using promises (or async functions) as forEach callback.

The result is that the loop kicks off all 10k transformations right away instead of in some serial/concurrent loop which the code suggests.

The fix for that seems quite simple:

  if (_.isArray(parsedContent)) {
    for (let i = 0, l = parsedContent.length; i < l; i++) {
      const obj = parsedContent[i];

      await transformObject(
        obj,
        createNodeId(`${node.id} [${i}] >>> JSON`),
        getType({ node, object: obj, isArray: true })
      )
    }
  }

I feel you could get fancy here and use something like .eachLimit to add some concurrency and speed it up again.

Reproduction Link

https://github.com/joernroeder/gatsby-json-memory

Steps to Reproduce

  1. clone the reproduction link and rungatsby start
  2. Watch the memory consumption

Expected Result

no multi GB memory usage :)

Actual Result

broken builds

Environment

Binaries:
    Node: 14.16.1
    Yarn: 1.22.17
    npm: 7.16.0 - ~/.nvm/versions/node/v14.16.1/bin/npm
  Languages:
    Python: 2.7.16 - /usr/bin/python
  npmPackages:
    gatsby: @next => 4.3.0-next.1 
    gatsby-source-filesystem: ^4.2.0 => 4.2.0 
    gatsby-transformer-json: ^4.2.0 => 4.2.0

Config Flags

No response

LekoArts commented 2 years ago

Hi, thanks for the issue! As you already pointed out the issue might be related to https://github.com/gatsbyjs/gatsby/issues/33868 and as my colleague also wrote there it might be a good idea to try streaming instead. It's not something we can prioritize at the moment but we'd be happy to review a PR with this. Thanks!

joernroeder commented 2 years ago

@LekoArts streaming might be the ultimate solution but I feel converting the loop to respect the async nature of createNode might be a good start and doesn't seem to be to difficult (the code above is all we need) and would fix at least this issue here. I would open a PR to get the fix out (and me to gatsby 4) if you don't mind.

joernroeder commented 2 years ago

Also just to make it clear, the same amount of nodes (same json file) is currently transformed in the gatsby 3 setup successfully which indicates that something might have changed upstream and causes the issue now.

KyleAMathews commented 2 years ago

What changed with v4 is we now write data to LMDB which is more costly than writing to memory. The sync forEach loop doesn't allow LMDB to flush data to disk meaning in progress changes accumulate. Switching to async solves as does using e.g. process.nextTick to let the event loop run through again.

joernroeder commented 2 years ago

@KyleAMathews ah that makes sense, didn't know LMDB is part of gatsby 4 by default – that's awesome!