Yoast / wordpress-seo

Yoast SEO for WordPress
https://yoast.com/wordpress/plugins/seo/
Other
1.75k stars 885 forks source link

Indexables CLI command never seems to finish #15741

Open rmens opened 4 years ago

rmens commented 4 years ago

Please give us a description of what happened.

I tried to build the indexables database from the command line. But the command never seems to finish.

root@prod-web:/var/www/www.xxxxx.nl# wp yoast index --reindex This will clear all previously indexed objects. Are you certain you wish to proceed? [y/n] y Indexing posts 100% [===========================================] 36:13 / 35:44

Please describe what you expected to happen and why.

To finish or to give an accurate indication of the time left.

How can we reproduce this behavior?

  1. Run 'wp yoast index --reindex' from the terminal
  2. Watch the process reaching 100% but not finish

Technical info

Used versions

Djennez commented 4 years ago

@rmens can you check your database for the yoast_indexables table and see if there is a list of duplicate rows in there (I suspect somewhere at the end). If so, what type of objects are these?

rmens commented 4 years ago

@Djennez I checked, there are no duplicate rows as far a I can see (sorted by ID). There are still new IDs being created while the CLI shows 100%. These new entries are not duplicates.

Djennez commented 4 years ago

Interesting. So it isn't actually finished, it just shows the 100%-mark too early.

How many objects does your site roughly possess (users, terms, tax's, posts)? Any idea if something might be interfering with the indexation output?

rmens commented 4 years ago

The site has one custom taxonomy and has wordfence enabled. It makes use of a lot of ACF fields (about 15 per post). I can't think of anything that is interfering with the output, but it seems very inaccurate on this site.

rmens commented 4 years ago

Another observation. The process seems to slow down a lot after reaching 100%.

Couning the 'created at' values, it starts with about 3400 rows per minute but slows down a lot to 6 rows per minute.

Djennez commented 4 years ago

@rmens I've discussed this with one of the architects. The amount of objects to be indexed is calculated beforehand, so if new objects get created after that calculation, they are not included in the percentage, but they are still indexed. Can it be that your website is continuously generating indexable objects? Are all posts handwritten?

rmens commented 4 years ago

@Djennez All posts are handwritten. There are about 6 articles published per day. The postmeta is not growing very rapidly.

The indexation is now 1131 minutes in and still processing articles. The most recent one was an article from 2018 according to the permalink. The permalink was not a duplicate. This seem to be legitimate articles that are not 'seen' on the first calculation.

rmens commented 4 years ago

We are now 2580 minutes in and the process is still running and filling the Indexables table. It seems to slow down even further with sometimes 3 entries per minute, sometimes 1 entry per minute. I wonder if it will ever finish.

On another site of ours the same problem occurs. The process is way beyond the original quoted 37 minutes and slows down a lot over time.

Schermafbeelding 2020-07-30 om 11 38 32

Something that might be of interest: the database server for this sites is not on localhost, but is on the LAN on a separate server.

rmens commented 4 years ago

After more than a week I killed the process and updated to Yoast SEO Premium 14.7 to try again.

Not much seems to have changed. The indexable command indicated an ETA of ~30minutes, but keeps running after 100%. Funny thing is when I kill it (ctrl-c) and re-start the indexation without the --reindex option, it always finds new values to index and gives an ETA of 11 minutes, to not finish after that. There are no new posts published at that point.

To experiment further I created a copy of the site on a different server and switched our custom theme to twenty-twenty. The theme switch made the indexation work! Same database, same plugins, just activating another theme.

Is there any more information I can give you guys to help solving this mystery, or insert some analyzer/debugger/verbose mode in the process @Djennez? Would be more than happy to.

Djennez commented 4 years ago

That is weiiiird. There isn't a whole lot related to themes that would be able to have any significant impact on this WP-CLI command. Does the theme contain any yoast/wpseo-related filters? Like wpseo_post_type_archive_indexation_limit?

Besides the theme difference, it's probably also the case that the live version has visitors, while your test environment does not. So CPU / memory-wise, there may also be complications. Some servers may have some sort of priority system that would lower the priority of a commandline process in favor of the webserver. While that could explain the slowdown, it would not explain the fact that the script is indexing past 100%...

rmens commented 4 years ago

We use a filter to hook in some schema.org data for reviews, but not other things like limits. The theme uses ACF on posts for some business logic and editorial extra's. That's the main things I know.

I can reproduce the problems on our test environment without any live traffic. It has the exact same setup. So live traffic is not the issue here.

If it helps I can send you a copy of the theme via a more secure way, or obtain a list of all yoast/wp-seo filters we use.

Djennez commented 4 years ago

Hmm, a test version of your theme might be helpful. If that reproduces the issue for me as well we can more definitively say that the issue is theme-related (albeit: weird) and I can dig a bit further.

If you have a stripped version of your theme (please make sure there are no secrets in there) you can upload / send it via / to ***. I can't guarantee I'll immediately take a crack at it, but it is interesting to know what can cause this.

rmens commented 4 years ago

FYI I just tried this with Yoast 14.8 and that didn't fix the issue.

Djennez commented 4 years ago

I got around to checking it, but with the theme installed I can still index without issues. So we're looking at a somewhat more complex chain of events that is causing this.

Testing this did spark a possible workaround idea. If the theme is indeed the cause in your case, can you run the CLI command with the --skip-themes parameter? This should skip theme initialization and possibly skip the code that is causing this loop.

rmens commented 4 years ago

With wp yoast index --reindex --skip-themes parameter it worked without issues.

Schermafbeelding 2020-08-31 om 13 58 30

When I did a wp yoast index --reindex it got stuck again.

So there seems to be something in the database that the theme calls that it's chewing on. I'm not sure how to proceed.

Djennez commented 3 years ago

Looks like a similar issue can be reproduced by setting WP_ENVIRONMENT_TYPE to something else than production. This is either a completely different issue, or the same issue with different causes. Because this thread was created with WP 5.4, which was not using WP_ENVIRONMENT_TYPE I believe.

rmens commented 3 years ago

That's correct. We do not use WP_ENVIRONMENT_TYPE (but should probably do). I looked trough the commits and there seems to be a fix for the WP_ENVIRONMENT_TYPE problems in 15.1. Should I test again after that is released?

rmens commented 3 years ago

Just checked again with Yoast 15.2.1 and the behavior is unchanged. It keeps on hanging on 100%. With --skip-themes it does finish, but there is another problem coming up. The "analyze internal links" tool never worked for this site either. Now that it's grouped under the same button, the whole Optimize SEO Data in the wp-admin show that it's never complete.

Djennez commented 3 years ago

@rmens I may have a new lead on a possible cause. If you're still having this issue, can you check if your yoast_indexables table has any entries with post_status = 'unindexed'? If so, can you compile a list of counts of every object_type that matches that?

rmens commented 3 years ago

@Djennez Yes, this is still an issue. I ran the query. There are 1351 rows matching the post_status = 'unindexed' and they all have the object_type post. I ran this on production with Yoast 15.5.

I compared some of the object_id values in the list to ID in wp_posts. Having checked five, all the object ID's of the rows with post_status = 'unindexed' don't exist as ID in the wp_posts table. That's weird!

Djennez commented 3 years ago

@rmens cool! Well, not cool, but it confirms that we're on the right track. I've recently opened an internal bugreport about this. These unindexed entries are created as a failsafe when an object can't be successfully built as an indexable. And it is set to use the posttype and ID of the object that is failing to build. However, when there are more than 25 of these entries of a single posttype, it will loop when the indexable building starts.

Based on the ID's, do you have any idea if there could be any reason in your site why these IDs are being considered for indexable building?

And, this will probably not fix your issue, but are these unindexed entries recreated if you delete them from the database and rerun the indexing process?

rmens commented 3 years ago

@Djennez After deleting the unindexed entries and running the process again, it created 3 unindexed entries. When I delete these again and run the index process again, it immediately makes the 3 same records again. They have the same object_id each time. These do not exist as posts.

The 3 unindexed entries have an object_id of 42171, 43080 and 44579. Which we don't have in the database. There is also no post meta referencing these post IDs.

I noticed the database has some interesting hops in the IDs. It goes from 41k to 51k and then up to 100k. The higher IDs have an older date too.

Schermafbeelding 2021-01-06 om 13 53 02 Schermafbeelding 2021-01-06 om 13 52 51
Djennez commented 3 years ago

With 3 unindexed entries the loop should not happen I believe, unless it encounters more of those entries while the index is running or the limit is set from 25 to 3 or lower. Apparently there is something in the site that comes up with those ID's. The fallback code for erroneous entries is literally saving the ID of the object that is being processed at that moment. Maybe there is more information to be found when that catch is triggered.

The jump in ID's is strange, but I'm not sure if its related. Those missing ones may be indexables that have been removed due to content that was deleted. I believe ID's are not reused once removed (not sure about that).

tony47 commented 3 years ago

Hello, I just experienced this problem today and found that it was caused by a post (attachment) having post_parent value the same as the ID.

mikeyarce commented 3 years ago

I'm digging into a very similar issue and it looks like the indexing is in an infinite loop due to postmeta values that contain Devanagari unicode characters. For example this attachment with the _wp_attatched_file meta of 2018/03/গোরেমে-ওপেন-এয়ার-মিউজিয়ম-২.jpg will loop infinitely, but if I change the meta value to test.jpg it indexes properly.

When the amount of indexables that are unable to be processed reaches the limit, it just loops endlessly on those objects and the indexing never completes.

rmens commented 3 years ago

Can you tell us if the issue with the loop when there are 25 or more unindexed items is fixed in a newer version of the plug-in @Djennez? That would be a good moment to start testing again.

mikeyarce commented 3 years ago

Update to https://github.com/Yoast/wordpress-seo/issues/15741#issuecomment-772255415 My problem had to do with my testing site and it's locale which was breaking the output of parse_url due to different unicode characters. This problem was not present on our production servers.

However, I found another problem causing an infinite loop. There's a query that counts how many posts still need to be indexed, and it's a no limit query: https://github.com/Yoast/wordpress-seo/blob/38465e0488e40223bad8917b1e0fa55389a27c0c/src/actions/indexing/indexable-post-indexation-action.php#L145-L149 Our production environments have a hard limit of 1000, so in this case, the same 1000 IDs are always queried so the indexing never stops because there's always more 😄

ovidiul commented 3 years ago

I've done some testing of the code from the get_query method and one of our clients has over 600K posts in the datababase, the original one always spends over 28 seconds to perform just to get the count of posts to index.

I did adjust the query with a LEFT JOIN option as below

protected function get_query( $count, $limit = 1 ) {
        $indexable_table = Model::get_table_name( 'Indexable' );
        $post_types      = $this->get_post_types();
        $replacements    = $post_types;

        $select = 'P.ID';
        if ( $count ) {
            $select = 'COUNT(P.ID)';
        }
        $limit_query = '';
        if ( ! $count ) {
            $limit_query    = 'LIMIT %d';
            $replacements[] = $limit;
        }

        return $this->wpdb->prepare(
            "
            SELECT $select
            FROM {$this->wpdb->posts} AS P
            LEFT JOIN $indexable_table AS I ON P.ID = I.object_id AND I.object_type = 'post'
            AND I.permalink_hash IS NOT NULL
            WHERE  I.object_id is Null AND P.post_type IN (" . \implode( ', ', \array_fill( 0, \count( $post_types ), '%s' ) ) . ")
            $limit_query",
            $replacements
        );
    }

and that one seems to take a little over 4 seconds to execute,

Screenshot 2021-02-08 at 16 25 10

It goes the same for the non-count query, the original query takes 4 seconds to fetch 25 ID results, while the LEFT JOIN takes 0.28 sec.

Happy to propose a pull request if needed.

sacerro commented 3 years ago

I had some similar issue. In my case for some unknown unrelated issue with Yoast, some plugin deletes some terms from tems table but not deleted from term_taxonomy table.

When yoast indexation process start I could see that Yoast generates some null values in table and set as "unindexed" and it starts to run looped (in frontent or cli I could see is looped with the same "object_id" and the same ID in wp_yoast_indexable table).

I run a query to "fix" this deleted terms (for this example, WP prefix is "wp_") by run a DELETE from term_taxonomy table:

DELETE wptt FROM wp_term_taxonomy AS wptt LEFT JOIN wp_terms AS wpt ON wpt.term_id = wptt.term_id WHERE wpt.name IS NULL; Later I truncate wp_yoast_indexable table, and run process again. And now all works without problems.

Maybe this deleted terms are related to this issue?

rmens commented 3 years ago

Just ran the indexable command again with Yoast 16.1.1 since it included some fixes for the indexable proces.

Indexing posts 49 % [=====================================================> ] 253:31 / 42895:48

The behavior is different now. The time keeps creeping up and it never goes beyond 49%.

There are still many entries with post_status unindexed in wp_yoast_indexable. 727 of them are of the type post and one is of the type term.

Schermafbeelding 2021-04-08 om 18 24 29
Djennez commented 3 years ago

Yeah, unfortunately it's still an issue. There are some (architectural) issues open on our internal tracker to get these solved.

rmens commented 3 years ago

While we are approaching one year of problems with the indexable, I was hopeful to find "Fixes a bug where the SEO optimization could run indefinitely when the database contained at least 25 faulty indexables without a permalink." in the 16.7 release notes.

Unfortunately things didn't change. There is some other behavior now. While running a re-index from the command line, the progress gets to 7% and comes to a halt after that.

When checking the wp_yoast_indexable table it shows about 400 unindexed entries with the object_type post. The newest unindexed post had object_id 42200 while the highest post_id in my wp_posts is 42195. The wp_postmeta table also doesn't refer to any post with ID 42200.

metodiew commented 1 year ago

hey @Djennez out of curiosity, is there any movement in this issue or a related one? We have a project where the wp_yoast_indexable is 300GB, where wp_posts and wp_postmeta are less than 13GB combined

amboutwe commented 1 month ago

Please inform the customer of conversation # 1154785 when this conversation has been closed.