gatsbyjs / gatsby

The best React-based framework with performance, scalability and security built in.
https://www.gatsbyjs.com
MIT License
55.27k stars 10.31k forks source link

[gatsby-source-wordpress] Large WordPress site causing extremely slow build time (stuck at 'source and transform nodes') #6654

Closed dustinhorton closed 4 years ago

dustinhorton commented 6 years ago

Description

gatsby develop hangs on source and transform nodes after querying a large WordPress installation (~9000 posts, ~35 pages).

Is there any guides as to what's too big for Gatsby to handle in this regards?

Environment

  System:
    OS: macOS High Sierra 10.13.6
    CPU: x64 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 8.10.0 - ~/n/bin/node
    Yarn: 1.5.1 - ~/n/bin/yarn
    npm: 5.6.0 - ~/n/bin/npm
  Browsers:
    Chrome: 67.0.3396.99
    Safari: 11.1.2
  npmPackages:
    gatsby: ^1.9.273 => 1.9.273
    gatsby-image: ^1.0.54 => 1.0.54
    gatsby-link: ^1.6.45 => 1.6.45
    gatsby-plugin-google-analytics: ^1.0.27 => 1.0.31
    gatsby-plugin-postcss-sass: ^1.0.22 => 1.0.22
    gatsby-plugin-react-helmet: ^2.0.10 => 2.0.11
    gatsby-plugin-react-next: ^1.0.11 => 1.0.11
    gatsby-plugin-resolve-src: 1.1.3 => 1.1.3
    gatsby-plugin-sharp: ^1.6.48 => 1.6.48
    gatsby-plugin-svgr: ^1.0.1 => 1.0.1
    gatsby-source-filesystem: ^1.5.39 => 1.5.39
    gatsby-source-wordpress: ^2.0.93 => 2.0.93
    gatsby-transformer-sharp: ^1.6.27 => 1.6.27
  npmGlobalPackages:
    gatsby-cli: 1.1.58

edit: Just want to reiterate—this is not something easily fixable by deleted .cache/, .node_modules, etc. If that resolves your problem, you weren't experiencing this issue.

pieh commented 6 years ago

Can You prepare reproduction repo? Number of posts shouldn't be a problem (at least at this step) - v1 might get into memory problems but this would be in later build step and shouldn't get stuck

dustinhorton commented 6 years ago

Was curious if it was an issue with Local by Flywheel, and able to build the site when serving WordPress via MAMP Pro.

But, I'm not even building post pages yet (am building the pages), and the execution time for that problematic step is 636.41s (just shy of 11 minutes).

const path = require('path')

exports.createPages = ({ boundActionCreators, graphql }) => {
  const { createPage } = boundActionCreators

  const postTemplate = path.resolve('./src/templates/Post/Post.js')

  graphql(
    `
      {
        allWordpressPost {
          edges {
            node {
              id
              slug
            }
          }
        }
      }
    `
  )
    .then((result) => {
      console.log('posts')
      // const { data, errors } = result

      // if (errors) console.log(errors)

      // if (!data) return

      //data.allWordpressPost.edges.forEach(({ node }) => {
      //  const { id, slug } = node

      //  createPage({
      //    component: postTemplate,
      //    context: {
      //      id,
      //    },
      //    path: slug,
      //  })
      //})
    })

edit: just enable createPage for posts and execution of that item rose to 14 minutes. Brutal, but also interesting that it's only 3 minutes longer for ~9000 more items. It's sitting on ⠁ run graphql queries for long time currently.

edit: that ran for 419.470 s, or 7 minutes.

dustinhorton commented 6 years ago

@pieh Whoops, posted that before I saw you'd just replied. I can try to get this site up remotely tomorrow.

dustinhorton commented 6 years ago

And meant to include, this last line is where it hangs via Local, and takes forever via MAMP.

$ gatsby develop
success delete html and css files from previous builds — 0.017 s
success open and validate gatsby-config — 0.226 s
info One or more of your plugins have changed since the last time you ran Gatsby. As
a precaution, we're deleting your site's cache to ensure there's not any stale
data
success copy gatsby files — 0.013 s
success onPreBootstrap — 0.159 s
⠁ source and transform nodes -> wordpress__acf_posts fetched : 100
⠁ source and transform nodes -> wordpress__acf_pages fetched : 34
⠂ source and transform nodes -> wordpress__acf_media fetched : 100
⠈ source and transform nodes -> wordpress__acf_categories fetched : 13
⢀ source and transform nodes -> wordpress__acf_tags fetched : 0
⠄ source and transform nodes -> wordpress__acf_users fetched : 11
⢀ source and transform nodes -> wordpress__POST fetched : 9092
⢀ source and transform nodes -> wordpress__PAGE fetched : 34
⠐ source and transform nodes -> wordpress__wp_media fetched : 7483
⡀ source and transform nodes -> wordpress__wp_types fetched : 1
⠁ source and transform nodes -> wordpress__wp_statuses fetched : 1
⢀ source and transform nodes -> wordpress__wp_taxonomies fetched : 1
⠄ source and transform nodes -> wordpress__CATEGORY fetched : 14
⠈ source and transform nodes -> wordpress__TAG fetched : 19
⠐ source and transform nodes -> wordpress__wp_users fetched : 11
⡀ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "You are not currently logged in."
⠈ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
⡀ source and transform nodesThe server response was "404 Not Found"
Inner exception message : "No route was found matching the URL and request method"
success source and transform nodes — 636.410 s
dustinhorton commented 6 years ago

@pieh Haven't confirmed this will successfully build (now with the WordPress remote, it's taking hours), but it certainly reveals the issue: https://github.com/dustinhorton/gatsby-issue

Should be able to just clone that and build.

dustinhorton commented 6 years ago

Just ran twice for over 10 hours without the site finishing building. Please let me know what else I can provide for help debugging.

KyleAMathews commented 6 years ago

Could you try upgrading to v2? We've made a ton of speed improvements to different gatsby subsystems which should dramatically speed up large sites like this.

dustinhorton commented 6 years ago

@KyleAMathews I'll give that a shot tonight—thanks.

dustinhorton commented 6 years ago

@KyleAMathews v2 version @ https://github.com/dustinhorton/gatsby-v2-issue. Been building for about 50 minutes at this point.

dustinhorton commented 6 years ago

Killing it now. Site still hasn't built.

KyleAMathews commented 6 years ago

Another thing you can try is to enable tracing https://next.gatsbyjs.org/docs/performance-tracing/

We haven't added tracing support yet to gatsby-source-wordpress but the tracing reports might help you figure out where it's stalling.

If anyone else is interested in looking into this, a great PR would be to add tracing support to gatsby-source-wordpress. Lemme know if you're interested!

dustinhorton commented 6 years ago

Going to need to bail out on this unfortunately, as I need to spend all time I have porting over to a traditional theme—kind of crushed to not be able to use Gatsby. Everything else feels so backwards.

KyleAMathews commented 6 years ago

Sorry we haven't had a chance to look into this :-( Sprinting right now to get v2 out.

Is there a chance you could leave the WP site running? It definitely seems like there's a bug here that should be fixed.

KyleAMathews commented 6 years ago

I tweeted out asking for help so hopefully someone will jump on this soon :-)

https://twitter.com/gatsbyjs/status/1027079401287102465

dustinhorton commented 6 years ago

Wow, that's rad—thanks so much. Site isn't going anywhere for the time being (and I'll migrate a copy and update repro repo if it needs to).

Khristophor commented 6 years ago

@dustinhorton for what it's worth I've also noticed issues building a larger (~1,000 post) project on Local by Flywheel compared to our production environment with a CDN in front of it.

REST responses for Gatsby are 10-20x longer from Local than from production, so the site takes forever to build. I haven't spent time debugging the issue in Local yet, but it's on my to-do list :)

@KyleAMathews I could take a look at adding tracing to source-wordpress.

KyleAMathews commented 6 years ago

@Khristophor that'd be great!

Khristophor commented 6 years ago

@dustinhorton I'm seeing 404's for the images on your sample site (https://dustinhorton.com/gatsby-wp/wp-content/uploads/2018/07/IMG_9906.jpg, for example) that might be inflating the build time. Any chance you could look in to the paths for those?

dustinhorton commented 6 years ago

The WP_MEDIA requests run fairly quickly with results so figured I was in the clear, but I can take a look at that later this week if you think it may be the case.

On Wed, Aug 8, 2018 at 5:45 PM Chris Wiseman notifications@github.com wrote:

@dustinhorton https://github.com/dustinhorton I'm seeing 404's for the images on your sample site ( https://dustinhorton.com/gatsby-wp/wp-content/uploads/2018/07/IMG_9906.jpg, for example) that might be inflating the build time. Any chance you could look in to the paths for those?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gatsbyjs/gatsby/issues/6654#issuecomment-411562589, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFNRHTA-vqIwCTtioejUL-Ei3nM0Lbks5uO1vygaJpZM4VZ57n .

Khristophor commented 6 years ago

That's true, but part of the source and transform step is to download all the media items it finds in the REST response: https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-wordpress/src/normalize.js#L434

Getting 404's on 7504 images might be causing some problems ;)

dustinhorton commented 6 years ago

Believe I've cleaned up all the 404s. Will try to build tonight. Thanks all.

dustinhorton commented 6 years ago

Seemingly no change:

~/Sites/gatsby-issue-v2 (master)
→yarn build
yarn run v1.5.1
$ gatsby build
success open and validate gatsby-config — 0.009 s
success load plugins — 0.277 s
success onPreInit — 0.257 s
success delete html and css files from previous builds — 0.008 s
success initialize cache — 0.245 s
success copy gatsby files — 0.079 s
success onPreBootstrap — 0.001 s
⠁
=START PLUGIN=====================================

Site URL: http://dustinhorton.com/gatsby-wp
Site hosted on Wordpress.com: false
Using ACF: true
Using Auth: undefined undefined
Verbose output: true

Mama Route URL: http://dustinhorton.com/gatsby-wp/wp-json

⠁ source and transform nodesRoute discovered : /
Invalid route.
Route discovered : /oembed/1.0
Invalid route.
Route discovered : /oembed/1.0/embed
Invalid route.
Route discovered : /oembed/1.0/proxy
Invalid route.
Route discovered : /yoast/v1
Valid route found. Will try to fetch.
Route discovered : /yoast/v1/configurator
Valid route found. Will try to fetch.
Route discovered : /yoast/v1/reindex_posts
Valid route found. Will try to fetch.
Route discovered : /yoast/v1/ryte
Valid route found. Will try to fetch.
Route discovered : /yoast/v1/indexables/(?P<object_type>.*)/(?P<object_id>\d+)
Invalid route.
Route discovered : /yoast/v1/statistics
Valid route found. Will try to fetch.
Route discovered : /acf/v3
Invalid route.
Route discovered : /acf/v3/posts/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/posts
Valid route found. Will try to fetch.
Route discovered : /acf/v3/pages/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/pages
Valid route found. Will try to fetch.
Route discovered : /acf/v3/media/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/media
Valid route found. Will try to fetch.
Route discovered : /acf/v3/categories/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/categories
Valid route found. Will try to fetch.
Route discovered : /acf/v3/tags/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/tags
Valid route found. Will try to fetch.
Route discovered : /acf/v3/comments/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/comments
Valid route found. Will try to fetch.
Route discovered : /acf/v3/options/(?P<id>[\w\-\_]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/users/(?P<id>[\d]+)/?(?P<field>[\w\-\_]+)?
Invalid route.
Route discovered : /acf/v3/users
Valid route found. Will try to fetch.
Route discovered : /wp/v2
Invalid route.
Route discovered : /wp/v2/posts
Valid route found. Will try to fetch.
Route discovered : /wp/v2/posts/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/posts/(?P<parent>[\d]+)/revisions
Invalid route.
Route discovered : /wp/v2/posts/(?P<parent>[\d]+)/revisions/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/pages
Valid route found. Will try to fetch.
Route discovered : /wp/v2/pages/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/pages/(?P<parent>[\d]+)/revisions
Invalid route.
Route discovered : /wp/v2/pages/(?P<parent>[\d]+)/revisions/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/media
Valid route found. Will try to fetch.
Route discovered : /wp/v2/media/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/types
Valid route found. Will try to fetch.
Route discovered : /wp/v2/types/(?P<type>[\w-]+)
Invalid route.
Route discovered : /wp/v2/statuses
Valid route found. Will try to fetch.
Route discovered : /wp/v2/statuses/(?P<status>[\w-]+)
Invalid route.
Route discovered : /wp/v2/taxonomies
Valid route found. Will try to fetch.
Route discovered : /wp/v2/taxonomies/(?P<taxonomy>[\w-]+)
Invalid route.
Route discovered : /wp/v2/categories
Valid route found. Will try to fetch.
Route discovered : /wp/v2/categories/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/tags
Valid route found. Will try to fetch.
Route discovered : /wp/v2/tags/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/users
Valid route found. Will try to fetch.
Route discovered : /wp/v2/users/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/users/me
Valid route found. Will try to fetch.
Route discovered : /wp/v2/comments
Valid route found. Will try to fetch.
Route discovered : /wp/v2/comments/(?P<id>[\d]+)
Invalid route.
Route discovered : /wp/v2/settings
Valid route found. Will try to fetch.
Added ACF Options route.

Fetching the JSON data from 25 valid API Routes...

=== [ Fetching wordpress__yoast_v1 ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1
⠈ source and transform nodes -> wordpress__yoast_v1 fetched : 1
Fetching the wordpress__yoast_v1 took: 936.166ms

=== [ Fetching wordpress__yoast_configurator ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1/configurator
⢀ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__yoast_configurator took: 846.014ms

=== [ Fetching wordpress__yoast_reindex_posts ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1/reindex_posts
⢀ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__yoast_reindex_posts took: 1010.589ms

=== [ Fetching wordpress__yoast_ryte ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1/ryte
⠠ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__yoast_ryte took: 1022.977ms

=== [ Fetching wordpress__yoast_statistics ] === https://dustinhorton.com/gatsby-wp/wp-json/yoast/v1/statistics
⠄ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__yoast_statistics took: 820.827ms

=== [ Fetching wordpress__acf_posts ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/posts
⠈ source and transform nodes -> wordpress__acf_posts fetched : 100
Fetching the wordpress__acf_posts took: 6352.670ms

=== [ Fetching wordpress__acf_pages ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/pages
⡀ source and transform nodes -> wordpress__acf_pages fetched : 34
Fetching the wordpress__acf_pages took: 2760.048ms

=== [ Fetching wordpress__acf_media ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/media
⠈ source and transform nodes -> wordpress__acf_media fetched : 100
Fetching the wordpress__acf_media took: 4273.250ms

=== [ Fetching wordpress__acf_categories ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/categories
⠁ source and transform nodes -> wordpress__acf_categories fetched : 13
Fetching the wordpress__acf_categories took: 1029.029ms

=== [ Fetching wordpress__acf_tags ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/tags
⠈ source and transform nodes -> wordpress__acf_tags fetched : 0
Fetching the wordpress__acf_tags took: 941.066ms

=== [ Fetching wordpress__acf_comments ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/comments
⢀ source and transform nodes -> wordpress__acf_comments fetched : 9
Fetching the wordpress__acf_comments took: 2868.036ms

=== [ Fetching wordpress__acf_users ] === https://dustinhorton.com/gatsby-wp/wp-json/acf/v3/users
⠠ source and transform nodes -> wordpress__acf_users fetched : 11
Fetching the wordpress__acf_users took: 2049.181ms

=== [ Fetching wordpress__POST ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/posts
⠁ source and transform nodes
Total entities : 9094
Pages to be requested : 91
⠁ source and transform nodes -> wordpress__POST fetched : 9094
Fetching the wordpress__POST took: 152767.807ms

=== [ Fetching wordpress__PAGE ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/pages
⢀ source and transform nodes -> wordpress__PAGE fetched : 34
Fetching the wordpress__PAGE took: 2194.895ms

=== [ Fetching wordpress__wp_media ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/media
⢀ source and transform nodes
Total entities : 7504
Pages to be requested : 76
⢀ source and transform nodes -> wordpress__wp_media fetched : 7485
Fetching the wordpress__wp_media took: 132029.996ms

=== [ Fetching wordpress__wp_types ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/types
⢀ source and transform nodes -> wordpress__wp_types fetched : 1
Fetching the wordpress__wp_types took: 956.603ms

=== [ Fetching wordpress__wp_statuses ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/statuses
⢀ source and transform nodes -> wordpress__wp_statuses fetched : 1
Fetching the wordpress__wp_statuses took: 1017.845ms

=== [ Fetching wordpress__wp_taxonomies ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/taxonomies
⠠ source and transform nodes -> wordpress__wp_taxonomies fetched : 1
Fetching the wordpress__wp_taxonomies took: 1029.885ms

=== [ Fetching wordpress__CATEGORY ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/categories
⢀ source and transform nodes -> wordpress__CATEGORY fetched : 14
Fetching the wordpress__CATEGORY took: 943.710ms

=== [ Fetching wordpress__TAG ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/tags
⠠ source and transform nodes -> wordpress__TAG fetched : 19
Fetching the wordpress__TAG took: 1104.454ms

=== [ Fetching wordpress__wp_users ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/users
⡀ source and transform nodes -> wordpress__wp_users fetched : 11
Fetching the wordpress__wp_users took: 1325.604ms

=== [ Fetching wordpress__wp_me ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/users/me
⠂ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "You are not currently logged in."
Fetching the wordpress__wp_me took: 926.146ms

=== [ Fetching wordpress__wp_comments ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/comments
⠂ source and transform nodes
Total entities : 9410
Pages to be requested : 95
⡀ source and transform nodes -> wordpress__wp_comments fetched : 9397
Fetching the wordpress__wp_comments took: 85370.673ms

=== [ Fetching wordpress__wp_settings ] === https://dustinhorton.com/gatsby-wp/wp-json/wp/v2/settings
⠁ source and transform nodesThe server response was "401 Unauthorized"
Inner exception message : "Sorry, you are not allowed to do that."
Fetching the wordpress__wp_settings took: 808.396ms

=== [ Fetching wordpress__acf_options ] === http://dustinhorton.com/gatsby-wp/wp-json/acf/v2/options
⠂ source and transform nodesThe server response was "404 Not Found"
Inner exception message : "No route was found matching the URL and request method"
Fetching the wordpress__acf_options took: 1059.276ms

=END PLUGIN=====================================: 412457.896ms
⠁ source and transform nodes

And it's been sitting there for about 8 hours.

Khristophor commented 6 years ago

@dustinhorton what kind of hosting are you using? I think it's just killing your production box with the amount of requests. I believe I got it to finish (after quite some time, not eight hours) setting concurrent connections to something low, like 1 or 2.

dustinhorton commented 6 years ago

It's a decent VPS on Linode. I can get settings tweaked on it if that'd help. But the issue happens locally too.

pieh commented 6 years ago

https://github.com/gatsbyjs/gatsby/blob/46290c2b0e7894fca036bdcc658a5d1936c4221f/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L133-L159 this is sometimes not working correctly when we pull larger amount of files - network request get resolved but file write stream never finishes (or errors out). I think it would be great to add some kind of timeout after responseStream finish to wait for fsWriteStream to finish, and if it doesn't and destroy all resources and try to write file again (possibly make few retries) and actually errors out when it can't actually do that.

aman-developer commented 6 years ago

@pieh can you please send updated code for this file ?

/packages/gatsby-source-filesystem/src/create-remote-file-node.js

pieh commented 6 years ago

@aman-developer there is no fix for this yet - otherwise it would be published. Problem with this issue is there is no reliable way to reproduce this, so any fixes are guesses. Problem is in some cases (might be hardware and/or OS specific) filesystem writeStream doesn't finish and is getting stuck without throwing errors so any fix here really is trying to workaround problems in fs package / hardware/os not being reliable :/

dustinhorton commented 6 years ago

Have you had issues reproing with my repo & site? It's consistent for me.

n3v3rf411 commented 6 years ago

I use createRemoteFileNode to fetch remote images and I experience this same problem: download gets stuck at around 680/780ish.

In createRemoteFileNode, there is a listener to downloadProgress event that was added in https://github.com/sindresorhus/got/releases/tag/v8.0.0 but gatsby-source-filesystem uses got 7.1.0.

I tried upgrading got to the latest version 9.2.2 and could now successfully download all images.

Add this in package.json:

  "resolutions": {
    "got": "^9.2.2"
  }
n3v3rf411 commented 6 years ago

Also there seems to be some critical fixes in got after 7.1.0 like stream errors not being correctly forwarded, etc. (https://github.com/sindresorhus/got/releases/tag/v8.0.1)

pieh commented 6 years ago

I tried updating got, but still sometimes get stuck, but it's worth doing it anyway. Just note that downloadProgress stuff will either need disabling or some nicer output, because terminal/console get's spammed with progress when using that

n3v3rf411 commented 6 years ago

I was able to run gatsby develop after ~25 minutes but I had to reduce concurrency in create-remote-file-node.js from 200 to 20. I did get some 22 TimeoutErrors (but were redownloaded when executing gatsby develop again) after putting logs in that empty catch in processRemoteNode.

Not sure if it's because of got but maybe can experiment with other http clients...

...
success source and transform nodes — 1407.531 s
success building schema — 3.315 s
success createPages — 0.571 s
success createPagesStatefully — 2.797 s
success onPreExtractQueries — 0.012 s
success update schema — 3.268 s
warning There are conflicting field types in your data. GraphQL schema will omit those fields.
wordpress__wp_media.media_details.width:
 - type: number
   value: 916
 - type: string
   value: '224'
wordpress__wp_media.media_details.height:
 - type: number
   value: 916
 - type: string
   value: '225'
wordpress__wp_media.media_details.sizes.thumbnail.width:
 - type: number
   value: 150
 - type: string
   value: '150'
wordpress__wp_media.media_details.sizes.thumbnail.height:
 - type: number
   value: 150
 - type: string
   value: '150'
wordpress__wp_media.media_details.sizes.medium.width:
 - type: number
   value: 300
 - type: string
   value: '300'
wordpress__wp_media.media_details.sizes.medium.height:
 - type: number
   value: 300
 - type: string
   value: '200'
wordpress__wp_media.media_details.sizes.large.width:
 - type: number
   value: 768
 - type: string
   value: '1024'
wordpress__wp_media.media_details.sizes.large.height:
 - type: number
   value: 1024
 - type: string
   value: '682'
wordpress__wp_media.media_details.image_meta.aperture:
 - type: number
   value: 2.2
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.created_timestamp:
 - type: boolean
   value: false
 - type: number
   value: 1433226914
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.focal_length:
 - type: number
   value: 0
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.iso:
 - type: number
   value: 0
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.shutter_speed:
 - type: number
   value: 0
 - type: string
   value: '0'
wordpress__wp_media.media_details.image_meta.orientation:
 - type: number
   value: 1
 - type: string
   value: '1'
warning Using the global `graphql` tag is deprecated, and will not be supported in v3.
Import it instead like:  import { graphql } from 'gatsby' in file:
/Users/tandingan.wlb/Projects/gatsby/gatsby-issue/src/templates/Post/Post.js
success extract queries from components — 0.120 s
success run graphql queries — 223.335 s — 9121/9121 40.84 queries/second
success write out page data — 0.119 s
success write out redirect data — 0.001 s
success onPostBootstrap — 0.027 s

info bootstrap finished - 1643.854 s
{ TimeoutError: Timeout awaiting 'request' for 30000ms
    at Immediate.timeoutHandler [as _onImmediate] (/Users/tandingan.wlb/Projects/gatsby/gatsby-issue/node_modules/got/source/timed-out.js:39:25)
    at runCallback (timers.js:694:11)
    at tryOnImmediate (timers.js:664:5)
    at processImmediate (timers.js:646:5)
  name: 'TimeoutError',
  code: 'ETIMEDOUT',
  host: 'dustinhorton.com',
  hostname: 'dustinhorton.com',
  method: 'GET',
  path: '/gatsby-wp/wp-content/uploads/2015/05/20150302_061259.jpg',
  socketPath: undefined,
  protocol: 'https:',
  url:
   'https://dustinhorton.com/gatsby-wp/wp-content/uploads/2015/05/20150302_061259.jpg',
  event: 'request' }
RobinHerzog commented 6 years ago

I'm getting the same errors with prismic

RobinHerzog commented 6 years ago

I upgraded to "got": "^9.2.2" now it's working houra!

pieh commented 6 years ago

Definitely need to take a look to upgrade our got version. This is intermittment problem so it might be coincidence that it worked. @RobinHerzog please let us know if you will run into similar problems with upgraded version of got

dustinhorton commented 6 years ago

Updating got significantly reduced build time for my repro repo, but still consistently took nearly an hour last I tried.

pieh commented 6 years ago

@dustinhorton what portion of the build was pulling images (or source and transform data as we don't show explicitly how long downloading files take)?

RobinHerzog commented 6 years ago

I have 150MB images with a 1GB internet connection. Now it's working in. I need 30 sec to download et continue building.

nratter commented 6 years ago

I'm also having this issue consistently. Upgrading got did not solve this for me. Any success with adding additional tracing to source-wordpress so we can try and debug what the problem is?

hdoro commented 6 years ago

Tried changing concurrentRequests and perPage, as well as upgrading got to the latest version, but none worked. Right now I can fetch categories, posts, pages and tags, but when I include users or media, right after =END PLUGIN===, the plugin returns with an error: TypeError: Cannot read property 'id' of undefined.

If I include all routes and blacklist the ones I don't have access to, I get =END PLUGIN=== but it never finishes... This goes for tons of websites I tested, so I figure it might be my system somehow. If anyone wants to test this, here's the config:

    {
      resolve: 'gatsby-source-wordpress',
      options: {
        // Other URLs I tried:
        // https://clubedovalor.com.br
        // http://rivainvestimentos.com.br
        // http://queroinvestiragora.com/
        // https://www.clubedospoupadores.com/
        baseUrl: "aprenda.guiainvest.com.br",
        protocol: "https",
        hostingWPCOM: false,
        useACF: false,
        concurrentRequests: 10,
        perPage: 50,
        // Going with the excluded routes path
        // excludedRoutes: [
        //   '/*/*/plugins',
        //   '/rock-convert/**',
        //   '/yoast/**',
        //   '/wp-super-cache/**',
        //   '/*/*/users/me',
        //   '/*/*/settings',
        // ],
        verboseOutput: true,
        includedRoutes: [
          "/*/*/categories",
          "/*/*/posts",
          "/*/*/pages",
          "/*/*/tags",
          // You can toggle between media and users (or both)
          // All 3 scenarios will fail with the `'id' of undefined`
          // problem
          // "/*/*/media",
          "/*/*/users",
        ],
      },

PS: One URL that I did manage to fetch was https://wesbos.com/

HAPPY UPDATE: I managed to make it work (for smaller sites) with includedRoutes, even with users and/or media by including taxonomies in the query. Now I don't get the 'id' of undefined error :D

@pieh I believe users and media types are dependant upon taxonomies, so maybe they should come by default whenever the config contais either of these types? Let me know if I can help further troubleshooting! As a closing note, this taxonomies bug seems unrelated to the infinite build process. With sites larger than ~500 media files, I still can't finish the build process!

UPDATE Number 2: So, I've managed to make it work for queroinvestiragora.com, which has 600 media files but only 70 posts, it takes roughly 15 seconds after =END PLUGIN=== , but it works. However, www.clubedospoupadores.com has 702 media files and 336 posts and it won't compile.

PS: My config in these experiments is:

    {
      resolve: 'gatsby-source-wordpress',
      options: {
        baseUrl: "queroinvestiragora.com",
        protocol: "http",
        hostingWPCOM: false,
        useACF: false,
        concurrentRequests: 10, // I've also tried removing it and going with the default, it's the same result
        verboseOutput: true,
        includedRoutes: [
          "/*/*/categories",
          "/*/*/posts",
          "/*/*/pages",
          "/*/*/tags",
          "/*/*/media",
          "/*/*/users",
          "/*/*/taxonomies",
        ],
      },
    },
njmyers commented 6 years ago

Hello,

I managed to add tracing using the steps outlined here https://www.gatsbyjs.org/docs/performance-tracing/. Unfortunately it did not provide much info as it simply told me that indeed source and transform nodes is taking quite long.

I have however done some of my own debugging on the issue after having some non-deterministic behavior involving images. When running either develop or build script I would get a case where not all of the images would be downloaded and the localFile nodes would not complete. After digging into the code I have determined that there seems to be an issue here

https://github.com/gatsbyjs/gatsby/blob/ad142af473fc8dc8555a5cf23a0dfca42fcbbe90/packages/gatsby-source-wordpress/src/normalize.js#L483-L506

For me createRemoteFile node was failing due to server timeout errors and defaults to returning null. I had to add some logging to createRemoteFile node as well to determine this and get the actual server responses. Since these nodes don't complete and do not have ID's they don't get registered in the cache. The tmp files are deleted and the gatsby-source-filesystem was incomplete. For whatever reason (I haven't looked that far yet) upon running the build script again the source-filesystem was then deleted probably because the script detects the filesystem is invalid or incomplete. It was this process that was for me creating a loop and causing errors on future builds as the filesystem never completes.

I'm working on a fix that seems to alleviate some of the issues at least regarding large amounts of images. When the develop or build script is successful in downloading all of the images the first time, it subsequently is not deleted and then the build process happens quite rapidly as the images are properly cached by gatsby-source-filesystem! My build went from 15 minutes down to 1 minute.

I'm not sure whether this is related to builds that have large amounts of posts. My issue was directly related to downloading 1.6 GB of image data.

This is my first time working with source plugins for gatsby so if anyone has any thoughts or advice regarding this I would appreciate it! I should be able to post my repo later today I am working on getting it to use my local version of gatsby-source-filesystem without complications.

njmyers commented 6 years ago

Hello,

Following up on my comment from a few days ago. Here is my repo.

https://github.com/njmyers/byalejandradesign.com.git

I am using a monorepo in this project so here are some steps if you want to run the repository locally.

  1. Ensure you have the latest version of Yarn 1.12.3
  2. Clone the plugin branch git clone https://github.com/njmyers/byalejandradesign.com.git -b wordpress-plugin
  3. Run yarn && yarn bootstrap
  4. Navigate to the gatsby folder so you can look just at that folder cd packages/web
  5. Run yarn develop or yarn build-web. It should complete successfully the first time and subsequent runs of the same command will result in much quicker builds! Source and transform nodes takes 222s for me where as it was taking 3 times that earlier and/or not completing.
  6. If you want to see what is actually happening during source and transform you can look in your file browser at /packages/web/.cache/gatsby-source-filesystem you will see that the files are being created there.

I rewrote the downloadMediaFiles function completely. You can see that file at this link https://github.com/njmyers/byalejandradesign.com/blob/wordpress-plugin/packages/gatsby-source-wordpress/src/download-media-files.js

It is probably more verbose then it needs to be but I had to do this in order to figure out everything that is happening. The functionality that I changed is adding a promise rejection when createRemoteFileNode returns null. I then use a function downloadRunner to throttle the requests so that they don't all hit the server at once as well as a retry on promise rejections. I added 200ms throttle between each createRemoteFileNode request. I'm sure this value could be tweaked and some of this might be better suited to adding to createRemoteFileNode directly.

If anyone is curious the WP install is EC2 micro instance while the images are behind a CloudFront distribution. Personally I never had any issues with getting posts my issue was with getting images and I believe that most of the issues people are having are due to this.

njmyers commented 6 years ago

Also if anyone wants to trace or debug their own site I suggest starting here...

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L240-L244

I added logging to the catch clause and was able to determine that the image nodes were not being processed correctly as I was getting timeout errors and then returning null.

pieh commented 6 years ago

@njmyers I just did very brief look at that and I'm thinking that if this works, we should use similar approach in createRemoteFileNode directly. We are using queue there, so consumers of this function (gatsby-source-wordpress in this case) shouldn't need to worry about this. One thing that is potentially problematic is that 200ms throttle - maybe we could start without it and when we start to see problems automatically apply throttling (per hostname)

njmyers commented 6 years ago

@pieh Yes that would probably be the place to apply this logic. The throttling for me was a way to approach this and diagnose the issue so I agree that the createRemoteFileNode should be able to handle this on it's own.

Particularly problematic however is the current behavior of silently failing the errors and returning null. In my opinion there should be some communication about either the failure or success of the operation. I think createRemoteFileNode could be made more robust with the following functionality.

1) Eagerly create connections 2) If there are errors from the server begin to throttle and/or retry if needed 3) Set some sane defaults for throttling/retrying 4) Create an entry point for adjusting throttling/retrying 4) Reject a promise if for some reason the node is unable to be processed.

I can also say that I played around with timeout values here https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L135-L141. Although that increased the probability of a successful response I still had to add handling in order to ensure a successful response.

Most likely the correct entry point for this logic would be here.

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L259-L269

Where if the tasks are failing they are retried and/or failed and then finally rejected.

njmyers commented 6 years ago

Also just read briefly the queue docs. I see what you are saying about queue being able to manage this on it's own.

KyleAMathews commented 6 years ago

@njmyers nice investigation work! Definitely agree that the file downloading needs to be a lot smarter!

KyleAMathews commented 6 years ago

It could be nice actually to extract out the file downloading piece to its own package that focuses on this problem of downloading and caching remote files.

KyleAMathews commented 6 years ago

There's a good chance we'll need to use the functionality in multiple places in Gatsby and the future and it's something other folks on the internet would want to use as well.

njmyers commented 6 years ago

@KyleAMathews you mean extracting createRemoteFileNode to a separate package?