Yoast / wordpress-seo

Yoast SEO for WordPress
https://yoast.com/wordpress/plugins/seo/
Other
1.75k stars 885 forks source link

XML Sitemaps integration #15302

Open jdevalk opened 4 years ago

jdevalk commented 4 years ago

We want to let the Yoast SEO XML sitemaps run on indexables as this will perform much better. Some of the work needed for this is already done in the jdv/xml-sitemaps branch, see its changes compared to trunk.

As soon as a site has completed indexing, the current XML sitemaps code (henceforth known as “classic XML sitemaps”) should no longer be loaded. Instead the new XML sitemaps code (henceforth: "XML sitemaps integration") should be loaded.

The XML sitemaps for Yoast SEO should remain functionally the same regardless of which of them runs.

We'll also add a feature toggle that allows switching to classic XML sitemaps, so users can force the old implementation with multilingual solutions that are not compatible.

Code parts of the XML sitemaps integration

All of the code in classic XML sitemaps should have a counterpart in the XML sitemaps integration.

Images

Images in the XML sitemaps are taken from the internal linking table. Therefore we no longer need to parse every post. All the images and all the URLs for a sitemap should be grabbed in one find_many call.

Tasks

omarreiss commented 4 years ago

Let's also make sure the Router has a method of registering routes, so we can register routes for the sitemap similar to the way we register WP REST routes. This way we enable the router to actually do the routing and we keep the interface as consistent as possible.

In a normal web application, a router would also take care of redirects. I was thinking the Router could act as middleware for the redirect handler in Premium and the redirects integration in free as well. But since redirects need to be done as soon as possible in the loading order and the Router would only route on template_redirect this doesn't seem like the best idea.

mmikhan commented 3 years ago

Please inform the customer of conversation # 734729 when this conversation has been closed.

suascat commented 3 years ago

Please inform the customer of conversation # 744280 when this conversation has been closed.

wpsumo commented 3 years ago

@jdevalk @iamazik Any update oin the rebuild of XML sitemap?

archon810 commented 3 years ago

We run a busy server with lots of data and visitors, and started hitting this issue more and more lately. When it happens, the same query is stuck by the hundreds with various offsets in MySQL, and slave load shoots up to 300-400. We have 3 32GB 8-CPU slave servers, and this is the only situation currently that brings them to their knees.

Here's an example: image

And a sample of stuck queries: https://gist.github.com/e55ae2677ac44c7c3c3cf07004e5121e

These queries not only bog down the db servers and reduce the number of available connections, they also make the slaves lag.

From what I can tell, it is indeed sitemap generation that causes these. Once one query, which is as noted in https://github.com/Yoast/wordpress-seo/issues/12161 quite expensive, takes a long time, it snowballs when more sitemaps are accessed and eventually destroys everything.

Is generating sitemaps from indexables the solution here? If so, can we please get an update on the status of this task? It's coming up on 2 years now. With Yoast's sale, will there be more resources dedicated to resolving such deal-breaking issues?

Thank you.

archon810 commented 3 years ago

By the way, in the queries the LIMIT is 100, meaning 100 posts at a time, but the XML sitemaps contain 1000 entries, which is the default. Why is each query only getting 100 at a time instead of 1000?

Changing 100 to 1000 in the query doesn't really affect the run time negatively because the majority of the time is spent on searching and ordering, not returning data.

https://github.com/Yoast/wordpress-seo/blob/trunk/inc/sitemaps/class-post-type-sitemap-provider.php#L545 https://github.com/Yoast/wordpress-seo/blob/trunk/inc/sitemaps/class-sitemaps.php#L580

wpsumo commented 3 years ago

https://www.wpintense.com/2019/02/04/performance-optimisation-for-various-xml-sitemap-plugins/

archon810 commented 3 years ago

We found where the step is being reset to 100 for some reason https://github.com/Yoast/wordpress-seo/blob/trunk/inc/sitemaps/class-post-type-sitemap-provider.php#L175. As a temporary workaround, can you provide a filter to override this behavior perhaps?

archon810 commented 2 years ago

Even with the index described here https://github.com/Yoast/wordpress-seo/issues/12161#issuecomment-564685645, everything exploded again just now.

image

How can we get this under control? @jdevalk

archon810 commented 2 years ago

I think I can explain why this happens sometimes and brings our dbs to their knees - bots.

Specifically, when bots, whether malicious or not (like MailRu) start requesting dozens of sitemaps almost at the same time. I'm not sure if there's no cache for those, or if the cache expires, but all of them start generating at the same time and requests start piling up, each request making the issue worse and worse.

Here's an example of such an attack: image image

The evil IP is definitely a bot https://www.abuseipdb.com/check/91.213.50.11.

The solution is either moving sitemap regeneration to cron or always serving cached sitemaps while regenerating them in the background with another method. Whatever the solution, it can't involve running really expensive queries triggered by site visitors.

jdevalk commented 2 years ago

@archon810 if you have time, testing this (still in draft) pull request would be appreciated: https://github.com/Yoast/wordpress-seo/pull/17580

it's a rebuild of the XML sitemaps on indexables which should be loads faster.