alleyinteractive / wp-bulk-task

A library to assist with running performant bulk tasks against WordPress database objects.
GNU General Public License v2.0
13 stars 0 forks source link

Bug: bulk task always loops over all posts (from all post types) #30

Closed renatonascalves closed 3 weeks ago

renatonascalves commented 1 month ago

Description of the bug

@srtfisher brought to my attention one key flaw of the current implementation.

Currently, in large databases with millions of posts from different post types, the while loop keeps running, and WP_Query objects keeps being created and queried, even after the posts from the specific post type used in the Bulk_Task were handled.

The source of the issue seems to be the way we are getting the max_id here: https://github.com/alleyinteractive/wp-bulk-task/blob/6772f7ab6abd5e1209ba5b80fb37cf6a6cf1e33d/src/class-bulk-task.php#L512

In really large databases, this value counts the max ID of all posts in the database, not only the ones we might be using in a task like this:

( new Bulk_Task( 'test_filtered_run' ) )->run(
    [
        'post_type' => 'cars',
    ],
    function ( WP_Post $post ): void {
        wp_update_post( $post );
    }
);

So, in a scenario where the car post type might have 100 posts, but the posts table has 30000k from other post types, the bulk task will continue to loop the rest of the IDs (in batches, of course), until it reaches the end, long after the car posts were looped over. Instead of only the 100 posts from the car post type.

The issue affects all callbacks (user/term/post), except for the csv since it uses the max number of rows available in a CSV file.

--

I have some ideas for the solution, but opening this issue to get more feedback.

Steps To Reproduce

Here is some code you can use to test this. I tested using the unit tests. Besides the time it takes to loop everything.

diff --git src/class-bulk-task.php src/class-bulk-task.php
index 7e37a19..5bb2423 100644
--- src/class-bulk-task.php
+++ src/class-bulk-task.php
@@ -521,7 +521,7 @@ class Bulk_Task {
                $this->before_run();

                // All systems go.
-               while ( $this->min_id < $this->max_id ) {
+               while ( $this->min_id < 30629476 ) { // <- put a high number.
                        // Build the query object, but don't run it without the object hash.
                        $this->query = new WP_Query();

@@ -531,6 +531,10 @@ class Bulk_Task {
                        // Run the query.
                        $this->query->query( $args );

+                       error_log( wp_json_encode( 'Log the output' ) );
+                       error_log( wp_json_encode( $args ) );
+                       error_log( wp_json_encode( $this->query->post_count ) );
+
                        // Fork for results vs. not.
                        if ( $this->query->have_posts() ) {
                                // Invoke the callable over every post.

Additional Information

I also double check other implementations, like the one VIP uses: https://docs.wpvip.com/vip-cli/wp-cli-with-vip-cli/cli-commands-at-scale/

They use a different strategy than this one. Where they loop using the paged argument. The VIP implementation makes sure only the posts for the specific query is used/looped over.

kevinfodness commented 1 month ago

This behavior is on purpose. VIP's implementation uses a standard get_posts which runs WP_Query under the hood, which as we know can have very bad performance for particular types of queries (for example, a bulk task that is looking for a partial match in the post_content field - think, for example, targeting links to an old domain or http:// where a straightforward search-replace isn't sufficient because we have to do additional processing). This library uses a "seek" methodology where it looks for posts matching its criteria among a smaller slice of the database (default 10,000). Those queries are very efficient. In your use case where you have a database with millions of rows, bulk task is able to skip over the rows with no match very quickly, because it processes them in batches of 10,000 and the queries return nil results in a time approaching zero. Yes, you could micro-optimize this library to select the max post ID that matches the particular post type(s) you are looking for, but it's also possible to run this without specifying post types at all (run on all post types) and it would be (imo) a small optimization for a narrow use case.

Putting some numbers here - I created a test bulk task on a database with about 7.5 million records in the posts table and set it up to find all published posts of a nonexistent post type. It was able to scan through all 7.5 million records in the posts table in 4.4 seconds. I'd rather not add a micro-optimization to shave off those few seconds that would add complexity to the library which could itself be the source of other bugs in the future.

kevinfodness commented 4 weeks ago

After discussing with a few folks, we've determined that the current method of constricting by a min and max ID may not be more performant than using a min ID with a limit clause in current versions of MySQL, and for queries where there aren't a lot of matches, removing the max ID constraint could result in a significant speedup, so that's the approach I'm looking at here.