gatsbyjs / gatsby

The best React-based framework with performance, scalability and security built in.
https://www.gatsbyjs.com
MIT License
55.27k stars 10.31k forks source link

[gatsby-source-wordpress] Large WordPress site causing extremely slow build time (stuck at 'source and transform nodes') #6654

Closed dustinhorton closed 4 years ago

dustinhorton commented 6 years ago

Description

gatsby develop hangs on source and transform nodes after querying a large WordPress installation (~9000 posts, ~35 pages).

Is there any guides as to what's too big for Gatsby to handle in this regards?

Environment

  System:
    OS: macOS High Sierra 10.13.6
    CPU: x64 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 8.10.0 - ~/n/bin/node
    Yarn: 1.5.1 - ~/n/bin/yarn
    npm: 5.6.0 - ~/n/bin/npm
  Browsers:
    Chrome: 67.0.3396.99
    Safari: 11.1.2
  npmPackages:
    gatsby: ^1.9.273 => 1.9.273
    gatsby-image: ^1.0.54 => 1.0.54
    gatsby-link: ^1.6.45 => 1.6.45
    gatsby-plugin-google-analytics: ^1.0.27 => 1.0.31
    gatsby-plugin-postcss-sass: ^1.0.22 => 1.0.22
    gatsby-plugin-react-helmet: ^2.0.10 => 2.0.11
    gatsby-plugin-react-next: ^1.0.11 => 1.0.11
    gatsby-plugin-resolve-src: 1.1.3 => 1.1.3
    gatsby-plugin-sharp: ^1.6.48 => 1.6.48
    gatsby-plugin-svgr: ^1.0.1 => 1.0.1
    gatsby-source-filesystem: ^1.5.39 => 1.5.39
    gatsby-source-wordpress: ^2.0.93 => 2.0.93
    gatsby-transformer-sharp: ^1.6.27 => 1.6.27
  npmGlobalPackages:
    gatsby-cli: 1.1.58

edit: Just want to reiterate—this is not something easily fixable by deleted .cache/, .node_modules, etc. If that resolves your problem, you weren't experiencing this issue.

KyleAMathews commented 6 years ago

No just the file downloading and caching part. createRemoteFileNode would then just call this package and get back a promise that'd resolve when the file was downloaded (or returned from the cache).

mpartipilo commented 6 years ago

I'm having this problem with my own cockpit source plugin as well.

njmyers commented 6 years ago

I see so it would really be more like extracting these portions of code to a separate package...

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L125-L244

This seems to be the code that deals specifically with downloading and caching please correct me if I'm wrong. Happy to work on this! Just trying to figure out how it works in the greater ecosystem.

dustinhorton commented 5 years ago

Would a PR to only fix gatsby-source-wordpress be accepted, then extract the fix afterwards? Having trouble using @njmyers forked plugin as-is, and it seems like it's a huge improvement.

njmyers commented 5 years ago

@dustinhorton not sure if this helps any but I found that if you want to use a local plugin it's best to point gatsby directly to package.json file. I was having trouble getting gatsby to find my local plugin until I started specifying it explicitly.

https://github.com/njmyers/byalejandradesign.com/blob/d56b1938f6d1bc22c3cf2282bb3198e378fe3561/packages/web/gatsby-config.js#L91-L94

I'm still happy to work on this issue and even the new plugin as discussed. Just looking for a little guidance on how to integrate this as it seems like a disruptive change that could impact many other things that I am not aware of. @KyleAMathews any thoughts? I still feel as though the code here

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L125-L244

Is the part that should be extracted out into it's own package. That being said it is one of the core functions of createRemoteFileNode and I want to make sure I go about it correctly so it can be integrated back into the ecosystem properly.

pieh commented 5 years ago

@njmyers You are mostly correct with your code selection - we would also want our current queue (which ATM limit to 200 concurrent requests, which seems not great for local dev and apparently for wordpress) moved and probably changed.

@dustinhorton I think it's reasonable to use this in wordpress plugin first (mostly because it's practically done).

njmyers commented 5 years ago

@pieh Great thanks for your clarification! I'll start working on a new plugin.

Regarding a temporary wordpress-source fix my only other question would be what to do here

https://github.com/njmyers/byalejandradesign.com/blob/d56b1938f6d1bc22c3cf2282bb3198e378fe3561/packages/gatsby-source-wordpress/src/download-media-files.js#L169-L173

At the moment it would still be possible to have network errors and there needs to be a catch clause for the whole downloadMediaFiles function. What is the normal behavior for passing errors to gatsby? I would be happy to add that code into the wordpress plugin to properly pass the network errors up to the correct handler. Maybe we could display an error message and a reference to this issue? Thanks for your assistance!

dustinhorton commented 5 years ago

@njmyers Thanks—yeah I was replicating your setup as closely as possible, aside it being a monorepo (including referencing package.json). Running develop just gave errors as if there were no gatsby-source-wordpress. I'll give it another go here shortly.

dustinhorton commented 5 years ago

More faithfully recreated your monorepo, and oddly it's just sitting at source and transform nodes, like it was with the non-forked gatsby-source-wordpress before downgrading the got dependency.

@pieh Able to answer his inquiry @ https://github.com/gatsbyjs/gatsby/issues/6654#issuecomment-442536931 ?

njmyers commented 5 years ago

@dustinhorton Yes it should be sitting there for quite some time too if you have a lot of images. My fork will throw unhandled promise rejection if a remote file fails to download. That is why I would like to be able to have some mechanism to properly handle this scenario.

I think I read on another thread as well that there was talk of integrating some sort of progress manager as well since this would provide feedback about plugin status.

If you look in your OS file system under project-root/.cache/gatsby-source-filesystem you should be able to see all the images that are being downloaded. In my case it is almost 400 images now so it does take quite some time. However before using my forked version the plugin would silently fail on an error and then never progress causing the issue where source and transform would take for hours...

Do you have a repo? I would love to be able to try it on another site as so far I have only tested it in a real life situation on my site.

dustinhorton commented 5 years ago

@njmyers That'd rule—if you don't mind, shoot me an email: d@dustinhorton.com, or just look out for an invite. I'll get something prepped this evening.

LekoArts commented 5 years ago

Updating got solved all issues for me, too.

pieh commented 5 years ago

The problem with got@9 is that it requires Node 8 (https://github.com/sindresorhus/got/releases/tag/v9.0.0), so we can't upgrade ATM :(

We should be able to upgrade at least to got@8, but I'm not sure if this will fix the issues

edgarnansen commented 5 years ago

got@8 seems to implement RFC 7234 compliant HTTP caching, so gatsby-source-filesystem could supply it's own file system cache adapter. Which should at least reduce time spent in source and transform nodes the second time around given that the resource is cacheable.

gatsbot[bot] commented 5 years ago

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

Thanks for being a part of the Gatsby community! 💪💜

dustinhorton commented 5 years ago

@gatsbot Still an issue.

twhite96 commented 5 years ago

Was asked to contribute a blog post for y'all. Can't do it as it is stuck on source and transform nodes. Saw the other issue, but I am not seeing where there is a fix for this. It is a fork of gatsbyjs, latest upstream. I only got this to run once. It is always stuck transforming nodes.

twhite96 commented 5 years ago

It's failing to grab screenshots from a few sites while building. I'll add the offending sites in the morning.

c0d3d commented 5 years ago

@twhite96 I just ran into the issue and what worked for me was removing temporary files that I still had open (from emacs), not sure if that will help you or not, but it allowed my build to move forward.

johngrasty commented 5 years ago

So it's looking like this is still a problem…

tito300 commented 5 years ago

facing the same problem when using gatsby-source-s3 to pull a 100 photos and transform them through sharp. Anyone figured out a fix?

tito300 commented 5 years ago

Somehow my problem was fixed (randomly?). This is the steps I took, I created a new s3 bucket with fewer pictures (for testing) and then tried building and It built successfully very quicly. Then I decided to go back and try to pull from the original bucket and now all the sudden it built successfully in 49s when originally it would go on for hours. I don't know how the mere switch in bucket links fixed the stall but hope this helps someone figure it out. maybe it has to do with the cache?

njmyers commented 5 years ago

Hi All. I updated my local plugin version that I was using for a site that had this issue. I think it’s a better implementation as it uses ‘better-queue’ before ‘createRemoteNode’ and passes in the ‘concurrentRequests’ parameter. It’s a little bit redundant as ‘createRemoteNode’ already uses a queue but regardless this version seems to working well with the recent gatsby upgrades and gives feedback on the progress of the files. I will try to get a PR together for this. Sorry for delays I know I said I would work on this earlier but have been quite busy!

https://github.com/njmyers/byalejandradesign.com/blob/wordpress-plugin/packages/gatsby-source-wordpress/src/download-media-files.js

johngrasty commented 5 years ago

@njmyers

Thanks so much. Your version solved some problems that I was having. I combined that with a line or two to filter out downloading 25 GB of mp3s, and I am now set!

lucassilvagc commented 5 years ago

Definitely still an issue. I've been trying to compile my project for the last 24 hours. From approximately 12 tries, 3 succeeded with outputs and actual WP connection. Is there any fix to this? BTW, I've tried to use @njmyers version of the plugin (awesome job, actually!), but results were mixed. Sometimes it would complain about wordpress_parent or Date and eventually crash, but couldn't figure out what's actually going on with these errors. In other builds, different errors (but they do compile, which is interesting), which actually causes issues on GraphQL.

njmyers commented 5 years ago

@lucassilvagc can you post some outputs? I’m glad people are trying and testing the branch. Let’s get it working better so we can open the PR!

lucassilvagc commented 5 years ago

@njmyers Sure!

A quick overview of what's going on:

My website currently runs with ~1940 image files, maybe WordPress's fault by creating multiple image files multiple times. If I do use a vanilla gatsby-source-wordpress, the issue appears as intended (there's a "vanilla" build I've made yesterday evening on another build environent - which returns the same issue we're discussing on this issue altogether. This build works and compiles when all the image files are returned). By using your plugin (replacing all the files inside node_modules/gatsby-source-wordpress (correct me if I'm wrong on this)), gatsby develop returns me the following:

TypeError: Cannot read property 'wordpress_parent' of undefined

  - normalize.js:287 entities.map.e
    [amazingtec]/[gatsby-source-wordpress]/normalize.js:287:11

  - Array.map

  - normalize.js:286 Object.exports.mapElementsToParent.entities [as mapElementsToParent]
    [amazingtec]/[gatsby-source-wordpress]/normalize.js:286:12

  - gatsby-node.js:134 Object.exports.sourceNodes
    [amazingtec]/[gatsby-source-wordpress]/gatsby-node.js:134:24

warning The gatsby-source-wordpress plugin has generated no Gatsby nodes. Do you need it?
success source and transform nodes — 299.757 s
success building schema — 10.192 s

After a quick while, it outputs:

'Cannot query field "allWordpressPage" on type "Query". Did you mean "allSitePage"?',
    locations: [ [Object] ] } ]
error UNHANDLED REJECTION

  TypeError: Cannot read property 'allWordpressPage' of undefined

  - gatsby-node.js:54 graphql.then.result
    C:/Projects/amztec-gtby/amazingtec/gatsby-node.js:54:36

PS: this was a vanilla build of gatsby-source-wordpress that was "converted" to yours by replacing the files, as I said above. I think the fact that it can't query all the pages is related to no nodes being generated. Also want to notice that this build is equal as my vanilla one that works when this issue doesn't appear.

Also want to notice that adding routes appears to cause the same initial problem for me (as I wanted to avoid different pages that aren't related or will return errors due to multiple protection layers over WordPress). I just don't know if the routes I've listed are correct, or if I'm missing something after.

I'm very happy with your reply, this issue is currently being a huge setback to my project and I'm glad that you're still up on this issue. Thanks a lot!

MWalid commented 5 years ago

Having the same issue with 400+ custom posts with acf fields and 4000 image.

MWalid commented 5 years ago

I updated got and was able to build with 35 minutes

MWalid commented 5 years ago

Unable to build again after I updated got

lucassilvagc commented 5 years ago

As expected, since this bug still exists in gatsby-wordpress. 35 minutes to download and process all the images keeps being a very long time considering all the factors (avg internet speed, processing power, total of files and so on). You can try adapting @njmyers version to your specific use, it'll work like a charm on downloading every image file you have.

humbertqz commented 5 years ago

As expected, since this bug still exists in gatsby-wordpress. 35 minutes to download and process all the images keeps being a very long time considering all the factors (avg internet speed, processing power, total of files and so on). You can try adapting @njmyers version to your specific use, it'll work like a charm on downloading every image file you have.

My site was working fine when i had a small number of images but when i started adding more this also happens.

@MWalid how can i update the got ? Thanks.

nratter commented 5 years ago

been trying to build all day with no success. have around 1450 images.

nratter commented 5 years ago

We haven't been able to deploy for 2 days now. Can someone help point me in the right direction as to where this is occurring in the code so I can try and find a solution?

anagstef commented 5 years ago

We haven't been able to deploy for 2 days now. Can someone help point me in the right direction as to where this is occurring in the code so I can try and find a solution?

Have you upgraded your got nested dependency of the gatsby-source-filesystem to use at least version 9.4.0?

If not, you should add:

  "resolutions": {
    "gatsby-source-filesystem/got": "9.4.0"
  }

in your Gatsby project's package.json. Then remove node_modules and your yarn.lock file and install again.

Note: This resolutions feature only works for yarn. npm has not implemented this yet.

nratter commented 5 years ago

@anagstef thanks very much for the tip! I'll try this and report back.

renoke commented 5 years ago

When running gatsby develop, is there a way to keep local cache instead of fetching remote data each time the command is launched ?

nratter commented 5 years ago

@anagstef looks to be working much better! Thanks for the tip!

The output is very verbose when building with this version of got. Do you know if there's any way to remove this?

anagstef commented 5 years ago

@nratter I'm glad it worked for you!

Yes, I know that, it is very verbose and it cannot be turned off. Ruins all the useful console output.

After some investigation I have done, I think it is caused here: https://github.com/gatsbyjs/gatsby/blob/80c7023a8bc23886939205fe52e305277294e6af/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L155

As you can see it calls a console.log with the progress of the download of each file every time the downloadProgress event emits which happens too many times per second. This was not a problem before, because the old got version does not implement the downloadProgress event.

Maybe we can fix it with a PR? Looks like debugging leftover code.

fedort commented 5 years ago

I had the same issue, stuck on "source and transform nodes". After a lot of console.logs my problem ended up being time out issues with retrieving media files from wordpress. The problem wasn't the server not being able to handle it, but rather cloudflare rate limiting and throwing timeouts after about 350 requests.

I bypassed cloudflare, went straight to the vps and I'm no longer seeing "source and transform nodes", and my build finishes.

humbertqz commented 5 years ago

My workaround was to have a local wordpress for testing, the live site is in netlify, while deploying it did not cause any issue.

ancashoria commented 5 years ago

Guys, I managed to fix this by running createRemoteFileNode requests in serial instead of parallel.

Here's the function I'm using:

/**
 * Map over items array using the fn function but wait for each step to finish before moving to the next one
 */
exports.serialMap = async (items, fn) => {
  const results = []
  for (const item of items) {
    const result = await fn(item)
    results.push(result)
  }
  return results
}

and here's how I'm using it:

const imageNodes = await serialMap(node.___originalImages, imgUrl => {
  return createRemoteFileNode({
    url: imgUrl,
    parentNodeId: node.id,
    store,
    cache,
    createNode,
    createNodeId,
  })
})

After the images are downloaded, here's how my source and transform step looks

Downloading remote files [==============================] 1063/1063 156.1 secs 100%
Downloading remote files [==============================] 1064/1064 157.2 secs 100%
Downloading remote files [==============================] 1065/1065 158.4 secs 100%
Downloading remote files [==============================] 1066/1066 159.5 secs 100%
Downloading remote files [==============================] 1067/1067 160.5 secs 100%
Downloading remote files [==============================] 1068/1068 161.5 secs 100%
Downloading remote files [==============================] 1069/1069 162.6 secs 100%
Downloading remote files [==============================] 1070/1070 163.7 secs 100%
Downloading remote files [==============================] 1071/1071 164.9 secs 100%
Downloading remote files [==============================] 1072/1072 166.0 secs 100%
Downloading remote files [==============================] 1073/1073 167.5 secs 100%
Downloading remote files [==============================] 1074/1074 169.2 secs 100%
Downloading remote files [==============================] 1075/1075 171.0 secs 100%
success source and transform nodes — 175.271 s

Hope it solves your problems too. Cheers

IftekherSunny commented 5 years ago

@ancashoria where should I put this code?

jacobsilver2 commented 5 years ago

@ancashoria yes, I'm also unclear on where to place this code.

ancashoria commented 5 years ago

This is somewhat unrelated to the gatsby-source-wordpress plugin. I have the code above in my gatsby-node.js. The idea is that firing all those requests in parallel caused them to fail, so I wrote that helper function to fire them one after another.

I'm guessing there's a similar issue in gatsby-source-wordpress too, but I'm not that familiar with it. Sorry I can't be of more assistance.

tombunn commented 5 years ago

It seems to be related to massive images and slow internet connections. Netlify was able to build the site but my local connection was not as it is only 1MB/s download which caused it to timeout after 30s and fail on the large image.

dustinhorton commented 5 years ago

I have 1gb fiber and no 'massive' images.

nratter commented 5 years ago

I am not transforming blog images locally after downloading them wordpress, i simply use their url. It would be nice if there was a setting that disables the downloading of these images in this case.

njmyers commented 5 years ago

Guys, I managed to fix this by running createRemoteFileNode requests in serial instead of parallel.

Yeah the issue is really based on the fact that createRemoteFileNode uses concurrency of 200 which is too much for most WP servers. I have my images on CloudFront and was hitting some rate limits there.

I tried fixing the issue with a branched version of the source-plugin for a while but the issue really isn't in gatsby-source-wordpress it is in gatsby-source-filesystem. Ideally consumers of the createRemoteFileNode function would be able to pass in concurrency there. Then plugins could make the concurrency option available in their configs. I still would like to do a PR to address this issue!

The solution I have been using is just a simple script to modify the code inside node_modules. Really quite fragile and not ideal but it is a simple hack to modify the concurrency directly. Uses shelljs so it is supposed to work for windows users as well (haven't tried).

#!/usr/bin/env node
const path = require('path');
const shell = require('shelljs');

const FILE_PATH = path.resolve(
  __dirname,
  // add path to your root dir here,
  'node_modules',
  'gatsby-source-filesystem/create-remote-file-node.js'
);

shell.sed('-i', 'concurrent: 200', 'concurrent: 20', FILE_PATH);
amcc commented 5 years ago

I had the same issue, stuck on "source and transform nodes". After a lot of console.logs my problem ended up being time out issues with retrieving media files from wordpress. The problem wasn't the server not being able to handle it, but rather cloudflare rate limiting and throwing timeouts after about 350 requests.

I bypassed cloudflare, went straight to the vps and I'm no longer seeing "source and transform nodes", and my build finishes.

this was exactly my issue. Netlify was building very fast - less than 2 mins. Only about 30 posts, with around 500 source images. Locally wasn't every completing, simply unticking the CloudFlare status to be DNS only solved the issue immediately