Yoast / wordpress-seo

Yoast SEO for WordPress
https://yoast.com/wordpress/plugins/seo/
Other
1.76k stars 891 forks source link

Performance fix for Yoast sitemaps #12161

Closed dhilditch closed 4 years ago

dhilditch commented 5 years ago

Please give us a description of what happened.

The performance of Yoast sitemaps is hurting servers on larger websites.

With a client of mine, they have 1.4 million products, and Yoast is causing table-scans to happen on every segment of data loaded. It's good that Yoast builds the sitemap in chunks, but it's bad that table scans happen for each of these chunks.

The table scans are causing sitemaps to not work on large websites, and worse, they're hurting other user queries.

Please describe what you expected to happen and why.

I expect index-seeks and good performance regardless of how large websites get.

I expect the sitemaps to work.

How can we reproduce this behavior?

  1. Enable XML sitemaps on Yoast settings page on million+ item store
  2. View sitemap.xml page
  3. View slow query log

Technical info

The problem is two-fold:

  1. There is a post_date != '00-00-00 00:00:00' query added to the where clause. This prevents use of a sorted index column which is required for the pagination.

  2. There is no supporting index to avoid sorting operations.

I've posted a full analysis here:

https://www.wpintense.com/2019/02/04/performance-optimisation-for-various-xml-sitemap-plugins/

But basically:

  1. Remove this post_date != '00-00-00 00:00:00' part of the query from this file: https://github.com/Yoast/wordpress-seo/blob/a1628d1570a8e8cca896312509ac986182836c30/inc/sitemaps/class-post-type-sitemap-provider.php

  2. Add a supporting index to wp_posts so sorts are not required on every windowed-page request:

create index wpi_scalability_pro_sitemaps on wp_posts (post_status, post_password, post_type, post_modified)

Used versions

Rarst commented 5 years ago

Continuing our discussion from your post, in the past I worked on refactoring of sitemaps module and specifically paid close attention to feedback on performance at the scale you are talking about.

For the record I am not a DB specialist, so my experience with this is combination of my own observation, extensive feedback from plugin's users, and some reaching out to people who are specialists.

To reiterate your claim from the post is that main query for the sitemap post page is unoptimized and doesn't scale for large amounts of posts.

The query in question is (you chose to run a subquery part of it in your post):

SELECT l.ID, post_title, post_content, post_name, post_parent, post_author, post_modified_gmt, post_date, post_date_gmt
FROM (
    SELECT wp_posts.ID
    FROM wp_posts
    WHERE wp_posts.post_status = 'publish' AND wp_posts.post_type = 'test' AND wp_posts.post_password = '' AND wp_posts.post_date != '0000-00-00 00:00:00'
    ORDER BY wp_posts.post_modified ASC
    LIMIT 100 OFFSET 0
) o
JOIN wp_posts l ON l.ID = o.ID

The main challenge of SQL for sitemaps is that MySQL doesn't scale well for OFFSET operations. The more rows you have the more rows it walks through to reach your offset. The queries in the plugin had been very specifically written and audited to account for that.

I didn't have cool 1M posts on hand, but I generated 92K which is more than sufficient to illustrate the rate of performance drop off:

So with a considerable offset performance dropped of by 50 milliseconds, while your post claims performance drop off of multiple minutes (at larger sizes, but certainly not consistent with the pace I observe).

Further, removing AND wp_posts.post_date != '0000-00-00 00:00:00' part of query, you claim to be problematic and have major impact on performance of the query, makes zero difference to the run time for me.

Do I believe that you observe poor performance at a large site? Sure, that is certainly possible.

Do I believe that this query is inherently problematic? No, not from my experience or any feedback I've gotten before.

I suspect the question to explore here is why that specific server runs this query slowly, not what's wrong with the query.

dhilditch commented 5 years ago

Hi again - you keep using the word 'claim'. Should I record a video for you so you can see these are not claims, they are facts?

dhilditch commented 5 years ago

For the removal of that part of the query - you need to remove that AND add the index. If you only add the index, you won't see the boost, if you only remove that part of the query you won't see the boost. You need to do both.

dhilditch commented 5 years ago

And as for the 'dropoff' - I think you'll find both those queries are just doing full table scans and hence no difference in speed.

dhilditch commented 5 years ago

Finally -

I suspect the question to explore here is why that specific server runs this query slowly, not what's wrong with the query.

I optimised the server. There's nothing now wrong with the server stack. There were issues with it before, but not now.

Rarst commented 5 years ago

Hi again - you keep using the word 'claim'. Should I record a video for you so you can see these are not claims, they are facts?

I didn't mean this dismissively, I apologize for it coming across like that. I believe your observed results, though I disagree on conclusions so far (the need to change the query and add custom index).

For the removal of that part of the query - you need to remove that AND add the index. If you only add the index, you won't see the boost, if you only remove that part of the query you won't see the boost. You need to do both.

Gotcha. From point of view of plugin in general that would have two problems:

  1. The query is no longer functionally equivalent (which would need to be addressed otherwise).
  2. When I was working on this the priority was using native WP indexes for queries. Custom index would be challenging to reliably ship for general use case.

I optimised the server. There's nothing now wrong with the server stack.

I can't speak for your server stack, I can only say that there is extensive feedback on record from large sites on sitemaps to the contrary and it performing sufficiently well. As such I would suspect individual issue with environment or otherwise, not a systematic one with implementation.

Since I am no longer involved with development, I think I'll shut up now. :) Thank you for taking the time to research and report and I (always) hope there are opportunities to improve sitemap performance for general case, it was my pet module. :)

matrixpoland commented 5 years ago

I have problem too with high CPU usage. Maybe because of sitemaps. I have aproxx 250 pages

joelkarunungan commented 5 years ago

This issue occurs for me and can be observed in large sites with several hundred thousand pages simply by visiting the sitemap_index.xml page. It is loading very slow. Its trivial to insert 1M+ records in the wp_posts table or using the WP API to replicate this. There is a workaround I found: https://markjaquith.wordpress.com/2018/01/22/how-i-fixed-yoast-seo-sitemaps-on-a-large-wordpress-site/

redvivi commented 5 years ago

Hello @joelkarunungan,

I have the same issue and highly penalising regarding loading time (15 minutes per sitemap).

Have you recently tested https://markjaquith.wordpress.com/2018/01/22/how-i-fixed-yoast-seo-sitemaps-on-a-large-wordpress-site/ solution ?

I tried to implement it, however the command line is failing, executing: cd /srv/www/example.com && /usr/local/bin/wp eval '$sm = new WPSEO_Sitemaps;$sm->build_root_map();$sm->output();' > /srv/www/example.com/wp-content/uploads/sitemap_index.xml Throws PHP Notice: Undefined index: HTTP_X_FORWARDED_PROTO in phar:///home/wp/vendor/wp-cli/wp-cli/php/WP_CLI/Runner.php(1169) : eval()'d code on line 72 PHP Warning: Invalid argument supplied for foreach() in /var/www/wordpress/wp-content/plugins/wordpress-seo/inc/sitemaps/class-sitemaps.php on line 388 Warning: Invalid argument supplied for foreach() in /var/www/wordpress/wp-content/plugins/wordpress-seo/inc/sitemaps/class-sitemaps.php on line 388

Am I missing something?

joelkarunungan commented 5 years ago

@redvivi first solution I ended up using is @dhilditch solution:

https://www.wpintense.com/2019/02/04/performance-optimisation-for-various-xml-sitemap-plugins/ > Yoast SEO XML Sitemaps performance (this is same as what he mentioned above in this issue thread)

I eventually dropped Yoast's sitemap and used Google XML Sitemaps as mentioned in the link as well. This was the best solution for me despite the bugs with the 3 year old plugin.

Hoping though that someone would adopt the Google XML Sitemaps plugin through WP official so the plugin gets updated and its bugs fixed since its performance really outperforms Yoast's sitemap generator.

redvivi commented 5 years ago

@joelkarunungan, thanks for your feedback. Unfortunately, @dhilditch does not change the performance issue I am facing, I guess the data I want to sitemap does not apply in my case.

It is quite disappointing that there is not a solution from Yoast.

Uranbold commented 5 years ago

Also got this problem with 500k+ posts. I think it's Google fetching the posts and related to XML sitemap 100%.

SELECT l.ID, post_title, post_content, post_name, post_parent, post_author, post_modified_gmt, post_date, post_date_gmt FROM ( SELECT wp_posts.ID FROM wp_posts WHERE wp_posts.post_status = 'publish' AND wp_posts.post_type = 'post' AND wp_posts.post_password = '' AND wp_posts.post_date != '0000-00-00 00:00:00' ORDER BY wp_posts.post_modified ASC LIMIT 100 OFFSET 381500 ) o JOIN wp_posts l ON l.ID = o.ID

SergeyBiryukov commented 5 years ago

There is a post_date != '00-00-00 00:00:00' query added to the where clause.

For reference, post_date != '0000-00-00 00:00:00' and post_author != 0 were added in 3cdcfc9.

post_author != 0 was later removed in 0845fc9.

im1981 commented 5 years ago

Hello, i also had massive performance issues with the yoast sitemaps (site over 350.000 posts), adding the index and removing the post_date != '0000-00-00 00:00:00' improved performance a lot. Anyway i just build my own static file cache.

roryroryro commented 5 years ago

Hello Me too adding index in database and removing post_date improved speed a lot. However i still have the problem for custom taxonomy. I try to find the post_date in inc/sitemaps/class-taxonomy-sitemap-provider.php but cannot find. Any idea of how I can fix that in a similar way for custom taxonomy? Thank you

amboutwe commented 4 years ago

Please inform the customer of conversation # 555596 when this conversation has been closed.

amboutwe commented 4 years ago

Please inform the customer of conversation # 557857 when this conversation has been closed.

simplenotezy commented 4 years ago

Why in the world would this be prioritized as "a minor"? We have a 6 core server and are too experiencing huge CPU spikes possibly related to constant generation of sitemaps. This should be fixed and prioritized as an important task.

In general, from what I have experienced, Yoast is slow at a lot of things, including slowing our backend down 1 second per. request in our WP Admin backend and also generating sitemaps.

Slowly thinking towards going with another SEO plugin.

monbauza commented 4 years ago

Please inform the customer of conversation # 566981 when this conversation has been closed.

im1981 commented 4 years ago

Hi, i have created a multi column index which helped to improve the performance a lot: ALTER TABLEwp_postsADD INDEXpost_status_post_type_post_password_post_date_post_modified(post_status,post_type,post_password,post_date,post_modified) USING BTREE;

Time to create one post-sitemapXXX.xml sitemap went down from 3! Minutes to 10 Seconds (still much too long).

Drew-Dix commented 4 years ago

We are having the same issue with the sitemap killing the site. I see above this is marked a "minor" issue? That makes absolutely no sense. This should be a priority bug fix as it is relatively simple to address. What is the status of this bug - one that has been known since last February???

Djennez commented 4 years ago

Hi @Drew-Dix , severity currently encompasses both severity and priority. Though I agree that the severity, just looking at functionality, is very high, the priority is pretty low (based on the time this issue has been open). Not many people experience this issue, as far as I can see in the comments. It looks to be very site-specific and no details as to what causes this specifically have been shared. As @Rarst stated at the end of his post:

I suspect the question to explore here is why that specific server runs this query slowly, not what's wrong with the query.

So any detailed information that anyone experiencing this issue can share, would be very welcome.

Drew-Dix commented 4 years ago

As the fix is relatively simple, I would think it would ratchet up on the priority ladder unless Yoast has made a determination to only service very small sites. If that is the case, you should note that on your plugin page for potential users of Yoast to correctly set expectations. While I am not a developer, I did previously manage a development team. A fix like this would have been out the door in days. Not months.

stodorovic commented 4 years ago

These indexes should be added into WP core because it affects feeds, other plugins, ...

@Drew-Dix Please comment https://core.trac.wordpress.org/ticket/15499 too. Maybe it'll help to WP adds proper indexes.

Drew-Dix commented 4 years ago

Thanks @stodorovic will review. Also, to add - this is a chronic issue regarding performance that is progressive. Thus it impacts all Yoast users over time. As Google takes site speed into consideration regarding SEO, by using Yoast you are, ironically, undermining your SEO if your site grows. So this is an issue that is impacting all Yoast users. They just do not know it (yet).

jdevalk commented 4 years ago

We will rebuild the entire XML sitemaps logic once we have the Indexables table we're working on. This will make it possible to generate XML sitemaps simply by querying one table, that'll already contain the permalinks, so generation of an XML sitemap will become an utter breeze. Because we already plan to do that, this issue is no longer of importance as far as I'm concerned, and I'm thus closing. But please do re-open and / or comment if you think I'm missing something :)

For more context on the indexables stuff, it's all in this branch: https://github.com/Yoast/wordpress-seo/tree/feature/indexables-frontend

Drew-Dix commented 4 years ago

@jdevalk Thanks for the update. Do you have timing on when this update will be released?

jdevalk commented 4 years ago

Indexables will probably be released in January, no timing yet on when we'll rebuild XML sitemaps on top of it but I hope to do that soon(ish).

redvivi commented 4 years ago

Would you mind to provide guidance on information (including diag reports) you would require to take this further ? I assume that generating a 100k+ WordPress posts on any website would be sufficient to reproduce this issue...

joelkarunungan commented 4 years ago

@redvivi

What I do is first create a source table containing unique randomly generated content with 1M records. Just copy the wp_posts table and write a script that would churn out the unique random data.

Once that's done its just trivial to feed it into the WP posts table. Pseudocode below for my timezone GMT+8.

INSERT INTO db.destination (ID, post_author, post_date, post_date_gmt, post_content, post_title, post_excerpt, post_status, comment_status, ping_status, post_password, post_name, to_ping, pinged, post_modified, post_modified_gmt, post_content_filtered, post_parent, guid, menu_order, post_type, post_mime_type, comment_count)
SELECT 
    post_id AS ID,
    0 AS post_author,
    DATE_ADD(date_field, INTERVAL post_id SECOND) AS post_date,
    DATE_ADD(DATE_ADD(date_field, INTERVAL post_id SECOND), INTERVAL - 8 HOUR) AS post_date_gmt,
    post_content,
    post_title,
    '' AS post_excerpt,
    'publish' AS post_status,
    'open' AS comment_status,
    'open' AS ping_status,
    '' AS post_password,
    post_slug AS post_name,
    '' AS to_ping,
    '' AS pinged,
    NOW() AS post_modified,
    UTC_TIMESTAMP() AS post_modified_gmt,
    '' AS post_content_filtered,
    0 AS post_parent,
    CONCAT('http://example.com/blog/',
            post_slug, '/') AS guid,
    0 AS menu_order,
    'post' AS post_type,
    '' AS post_mime_type,
    0 AS comment_count
FROM
    db.source ORDER BY ID;
carl-alberto commented 4 years ago

Hi Folks, just checking if this is the tree that contains the fix for this? https://github.com/Yoast/wordpress-seo/tree/indexable-fron and should now be working fine in the 12.6.2 version? It seems it is still slow when querying a million rows and generating a sitemap? Just checking if we should open up a new Issue for this or just reopen this one

Djennez commented 4 years ago

Indexables is the 14.0 release, you can update to 14.0.4 (current latest) to see the improvements. If this is not fixed, please report back here with details.

redvivi commented 4 years ago

On my end, I have switched to https://wordpress.org/support/plugin/bwp-google-xml-sitemaps/, which provides a much more resilient implementation and cache priming mechanism - though less reactive, than Yoast one for such large indexes.

Perhaps if it is not solved, it would be wise to get inspiration from this implementation.

On Fri, 1 May 2020 at 08:58, Djennez notifications@github.com wrote:

Indexables is the 14.0 release, you can update to 14.0.4 (current latest) to see the improvements. If this is not fixed, please report back here with details.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Yoast/wordpress-seo/issues/12161#issuecomment-622275684, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDK7BO62ZRJAPJGH6PJARTRPJXJJANCNFSM4GUMZMQQ .

jdevalk commented 4 years ago

We haven't rebuild the XML sitemaps on top of indexables yet. That'll happen soon though :)

carl-alberto commented 4 years ago

Thanks for the feedback @Djennez @jdevalk :)

JeremieBALDY commented 4 years ago

@Djennez in v14.2 always the same problem. No improvement noted unfortunately.

@jdevalk We all look forward to the generation performance improvement! Especially on project-sitemap.xml for me (~35 seconds for 236 URLs), with divi (theme builder), php 7.4, powerful machine

stodorovic commented 4 years ago

@JeremieBALDY Please look #12304. It isn't Yoast SEO issue because Divi runs all shortcodes for all pages/posts/projects (which are created by Divi builder) to extract images. If you disable images in sitemap then you will see normal loading time. It'll be fixed when sitemaps get images on different way. Please report it to Divi too.

dhilditch commented 3 years ago

The main challenge of SQL for sitemaps is that MySQL doesn't scale well for OFFSET operations. The more rows you have the more rows it walks through to reach your offset. The queries in the plugin had been very specifically written and audited to account for that.

This is not true if you have an index which covers your WHERE statement followed by the same index covering your ORDER BY statement with the index columns in this order (WHERE columns followed by ORDER BY columns in the index).

See my article for proof where the SQL query is using OFFSET 483300 and runs in the same time regardless of which offset you're using.

If you're averse to adding indexes to help your slow queries, maybe populate a separate table under your control? Or maybe use the existing indexes on wp_posts if you can find a suitable one that covers the WHERE and ORDER BY statement.

jdevalk commented 3 years ago

@dhilditch a table under our own control is exactly what we already have in indexables, which we're planning to base the next iteration of our XML sitemaps on :)

laurent-le-graverend commented 3 years ago

@jdevalk do you have any ETA for this by any chance?

We're using Yoast Premium, for sites containing more than a decade of posts, some have nearly one million posts. Sitemaps are necessary but seem to be the root cause of recurring DB performances issues, even using dedicated DB instances with 512GB of memory.

Otherwise, do you have any recommendations for the time being? Maybe using a new index as it was proposed in this thread? 🙏

J-Rey commented 3 years ago

It looks like we can follow issue #15302 on this but doesn't look to have made much progress in the last year

wpsumo commented 3 years ago

@jdevalk Any update oin the rebuild of XML sitemap?

stmamick commented 2 years ago

I have observed the following with my 2 million pages (products). The sitemaps for 1000 took at least 1 minute to load each. When I had 50000 it was nearly an hour per sitemap. Now I am at 10 seconds with 50000 per sitemap, so perfectly fine. First change the links per sitemap to a bigger one if you have too many sitemaps, like so:

function max_entries_per_sitemap() { 
return 50000;
}
add_filter( 'wpseo_sitemap_entries_per_page', 'max_entries_per_sitemap' );

in: /wp-content/plugins/wordpress-seo/inc/sitemaps line 173 change the steps from 100 to the maximum. So if you set 50000 links per sitemap change it to this line:
$steps = min( 50000, $max_entries ); Because I observed that with the default one of 100 it would actually do get_posts() 500 times and also call the slow mysql query 500 times, which each took 10 seconds because of offset/limit. If you set it to the 50000 it will only do a single one that takes the same time as before the change. Then the loop in line 207-239 comment the things out below and write your own sitemap entry instead of calling the apply_filters function:

foreach ( $posts as $post ) {
    if ( in_array( $post->ID, $posts_to_exclude, true ) ) {
        continue;
    }
    if ( WPSEO_Meta::get_value( 'meta-robots-noindex', $post->ID ) === '1' ) {
        continue;
    }
    //Building the url array without calling anying fancy. Change the BASE_URL to the base url of your page.
    //Here you could also take the guid or build something else if not all your pages are linked directly after the base url
        $url = array("loc"=>"BASE_URL/".$post->post_name."/","mod"=>$post->post_modified_gmt,"chf"=>"daily","pri"=>0);
    if ( ! isset( $url['loc'] ) ) {
        continue; 
    }
         //This ones takes ages
         //$url = $this->get_url( $post );
     //$url = apply_filters( 'wpseo_sitemap_entry', $url, 'post', $post );

    if ( ! empty( $url ) ) {
        $links[] = $url;
        }
}

The line: $url = apply_filters( 'wpseo_sitemap_entry', $url, 'post', $post ); is extremely slow. I guessing for 50000 entires it would spend 20 minutes only with this line. After the change it does not really take any time. I also added indexes in the wp_posts table for all relevant columns (post_modified_gmt, post_type,...) as others have said. Altough the other changes were enough for the speed already.

jdevalk commented 2 years ago

This is the very first (very rough) iteration of code to retrieve the sitemap data + images from the indexables and links tables (for post, page and custom post type sitemaps only):

https://github.com/Yoast/wordpress-seo/pull/17580

If you can test this on a staging environment for a large site, i'd love to hear whether this works for you. I'm attaching a zip here of that pull for you to test with.

wordpress-seo-sitemaps-indexables.zip

void-zoop commented 9 months ago

So, old thread (seems to be the only one still up, did everyone just decide to walk away from Yoast?), same problems. Enormous loadspikes on our site despite trying every caching trick in the book, when asked to put in an xml file for the caching to set priorities, it became clear what the problem was. Trying to open the xml file takes tens of minutes (why isnt this cached? is it generated on each request? thats insane!), not something you want to show to search engines, it would be better for seo to actually not have a sitemap.xml, if its THAT slow.

Seeing that its now 2024 and these issues still have not been resolved (despite receiving an update for Yoast every other week or so), I tried another plugin for sitemap generation, XML Sitemap Generator for Google by Auctollo, and all server performance issues suddenly went away. Unbelievable that its been a SEO tool making our site this slow all this time (anytime something requested the sitemap, the site would grind to a halt).

Unbelievable to read that the Yoast team never took this seriously. Will be removing Yoast from 100+ hosted websites and cancelling any pro versions.

fliespl commented 8 months ago

Can't it be sorted using post_date instead of post_modified?