cabrerahector / wordpress-popular-posts

WordPress Popular Posts - A highly customizable WordPress widget that displays your most popular posts.
https://wordpress.org/plugins/wordpress-popular-posts/
GNU General Public License v2.0
279 stars 83 forks source link

Heavy impact of large amounts of SELECT query #248

Closed partyspy closed 4 years ago

partyspy commented 4 years ago

I have been encountered some performance impact while using this plugin. And I caught a lot of performance-consuming queries like this:

SELECT p.ID AS id, p.post_title AS title, p.post_author AS uid, v.pageviews 
    FROM `wp_posts` p 
    INNER JOIN (SELECT SUM(pageviews) AS pageviews, postid 
    FROM `wp_popularpostssummary` 
    WHERE view_datetime >= '2020-02-15 10:59:42' 
    GROUP BY postid) v 
    ON p.ID = v.postid 
    WHERE 1 = 1 
    AND p.post_type IN('post') 
    AND p.post_password = '' 
    AND p.post_status = 'publish'  
    ORDER BY pageviews DESC 
    LIMIT 10 OFFSET 0

Had traced back the code, I found that this is where it comes from. The get_items call back would execute the query every time it receives a POST REST request, as it appears.

I am wondering why register this route? Looks unnecessary.

 public function register_routes()
    {
        $version = '1';
        $namespace = 'wordpress-popular-posts/v' . $version;

        register_rest_route($namespace, '/popular-posts', [
            [
                'methods'             => \WP_REST_Server::READABLE,
                'callback'            => [$this, 'get_items'],
                'permission_callback' => [$this, 'get_items_permissions_check'],
                'args'                => $this->get_collection_params()
            ],
            [
                'methods'             => \WP_REST_Server::CREATABLE,
                'callback'            => [$this, 'update_views_count'],
                'args'                => $this->get_tracking_params(),
            ]
        ]);
//...
cabrerahector commented 4 years ago

Hi @partyspy,

The get_items call back would execute the query every time it receives a POST REST request, as it appears.

That's incorrect. POST requests are handled by the update_views_count method (see 'methods' => \WP_REST_Server::CREATABLE,?) which is in charge of tracking post/page views. It has nothing to do with the query you shared.

Regarding the performance issue itself, please read this and if you have any questions don't hesitate to ask.

cabrerahector commented 4 years ago

I am wondering why register this route? Looks unnecessary.

Almost forgot about this one. Here: the /popular-posts GET endpoint.

partyspy commented 4 years ago

Hi @cabrerahector Thanks for replying.

I'm not an expert in Wordpress, sorry for the misunderstanding.

I did implement the persistent cache though (with memcache). Yet I still caught a lot of performance bottlenecks on my server, regarding

script_filename = /var/www/html/website/index.php
[0x00007f97b301cea0] mysqli_query() /var/www/html/website/wp-includes/wp-db.php:2007
[0x00007f97b301ce20] _do_query() /var/www/html/website/wp-includes/wp-db.php:1895
[0x00007f97b301cd40] query() /var/www/html/website/wp-includes/wp-db.php:2579
[0x00007f97b301cc50] get_results() /var/www/html/website/wp-content/plugins/wordpress-popular-posts/src/Query.php:488
[0x00007f97b301cbe0] run_query() /var/www/html/website/wp-content/plugins/wordpress-popular-posts/src/Query.php:54
[0x00007f97b301cb80] __construct() /var/www/html/website/wp-content/plugins/wordpress-popular-posts/src/Rest/Controller.php:100
[0x00007f97b301cab0] get_items() /var/www/html/website/wp-includes/rest-api/class-wp-rest-server.php:946
[0x00007f97b301c8b0] dispatch() /var/www/html/website/wp-includes/rest-api/class-wp-rest-server.php:329
[0x00007f97b301c740] serve_request() /var/www/html/website/wp-includes/rest-api.php:309
[0x00007f97b301c6a0] rest_api_loaded() /var/www/html/website/wp-includes/class-wp-hook.php:286
[0x00007f97b301c5c0] apply_filters() /var/www/html/website/wp-includes/class-wp-hook.php:310
[0x00007f97b301c550] do_action() /var/www/html/website/wp-includes/plugin.php:531
[0x00007f97b301c490] do_action_ref_array() /var/www/html/website/wp-includes/class-wp.php:387
[0x00007f97b301c230] parse_request() /var/www/html/website/wp-includes/class-wp.php:737
[0x00007f97b301c1b0] main() /var/www/html/website/wp-includes/functions.php:1105
[0x00007f97b301c110] wp() /var/www/html/website/wp-blog-header.php:16
[0x00007f97b301c080] [INCLUDE_OR_EVAL]() /var/www/html/website/index.php:17

According to the slow log of php, I traced back to the get_items callback. The line 100 in Controller.php where The Query() in get_iterms appeared to be consuming too much and then usually followed by database breakdown.

Was that triggered by the /popular-posts GET endpoint.? If so, under what scenario would there be a GET request?

FYI, my site traffic is around 150k visitors per day.

partyspy commented 4 years ago

UPDATE:

GET /wp-json/wordpress-popular-posts/v1/popular-posts I had my server log checked, this request occurs in a frequency of a few times per minute on average. It's a floating frequency that sometimes higher and sometimes lower. If the request emerged a lot in a short period of time, the database would probably have a breakdown.

Just that I have not used this GET API in my implementation. I wonder where did that come from.

cabrerahector commented 4 years ago

Was that triggered by the /popular-posts GET endpoint.?

I'd say so. This endpoint uses the same class as the widget, the [wpp] shortcode and the wpp_get_mostpopular() template tag to fetch popular posts/pages/custom post types from the database. If PHP's log says the performance issue is being caused by calls to the get_items method then it's likely to be true.

If so, under what scenario would there be a GET request?

The plugin itself doesn't use this endpoint at all at the moment so the scenarios I can think of are:

My guess is that someone/something is repeatedly accessing your /wp-json/wordpress-popular-posts/v1/popular-posts endpoint for some reason which in consequence is making the plugin run the aforementioned database query so the endpoint can return the requested data.

Two possible solutions to prevent the abuse of this endpoint:

  1. Cache requests to this endpoint (which I'm planning to do after this discussion, but if you can't wait for me to implement this -which might take a while- maybe this plugin can help: WP REST Cache.)
  2. The nuclear option - completely remove this specific endpoint:
/**
 * Removes the /wordpress-popular-posts/v1/popular-posts REST API GET endpoint.
 *
 * @param   array   $endpoints
 * @return  array   $endpoints
 */
function wp6514_remove_wpp_popular_posts_get_endpoint($endpoints)
{
    $base_endpoint = "/wordpress-popular-posts/v1/popular-posts";

    foreach( $endpoints as $maybe_endpoint => $object ) {
        if ( false !== strpos($maybe_endpoint, $base_endpoint) ) {
            foreach( $endpoints[$maybe_endpoint] as $index => $endpoint ) {
                if ( 'namespace' === $index )
                    continue;

                if (
                    isset($endpoint['methods']) 
                    && 'GET' == $endpoint['methods'] 
                    && 'get_items' == $endpoint['callback'][1]
                ) {
                    unset($endpoints[$maybe_endpoint][$index]);
                }
            }
        }
    }

    return $endpoints;
}
add_filter('rest_endpoints', 'wp6514_remove_wpp_popular_posts_get_endpoint');

Note that the code above is basically a "hack" and so it might stop working on future versions of WordPress/WordPress Popular Posts. If you decide to use it do so with precaution.

partyspy commented 4 years ago

@cabrerahector Thank you for explaining this. I think I am gonna nuclear it for now.

partyspy commented 4 years ago

BTW just asking if I may:) is there any methods you know to check if the memcache has already successfully cached the plugin's queries? Since my memcache is applied to the whole site I find it a bit hard to distinguish that.

cabrerahector commented 4 years ago

That's a difficult one haha. I'll try to explain the best I can.

TLDR;

Unfortunately no, currently there's no easy way to tie data stored in Memcached to a specific SELECT query generated by WPP. You can however check whether the pageviews cache is working by checking if a key with _wpp_cache in it exists in Memcached.

The long version:

Currently, only the pageviews cache uses a key that clearly identifies itself as belonging to WPP (the key contains this string: _wpp_cache). This is so I could actually see whether the pageviews were being stored in Redis or not when I was testing the in-memory caching integration feature nearly over 2 years ago (OMG, time flies!)

Before said feature was implemented, data from SELECT queries generated by WPP widgets, [wpp] shortcodes and the wpp_get_mostpopular() template tag were (and still are) stored in transients. Because transients are supposed to delete themselves after a fixed period of time I didn't see a need to give them a meaningful key name (eg. _wpp_some_transient_id) so they could be easily identified. You'll need to check the actual keys and their values to find the ones related to each popular post instance on the public-facing part of your website (I might change this in the future though so each instance has its own WPP key name.)

To see the actual data that's currently stored on Memcached you can try this SO answer. Based on the number of upvotes it has I'm confident that it'll do the trick, although I haven't tried it myself.

partyspy commented 4 years ago

Thanks for elaborating this. @cabrerahector I dumped all the memcache keys using the method you recommended this SO answer However I didn't find any keys containing _wpp_cache though, which is weird. I did enable the pageviews cache following the steps in your documentation.

cabrerahector commented 4 years ago

Hmm, no idea to be honest. I'll try to set up a Docker container with Memcached enabled sometime this week to do some testing. I'll keep you posted!

partyspy commented 4 years ago

Ok, Thank you anyway. I'll keep troubleshooting it.

partyspy commented 4 years ago

Hey, I've done some diggings and checked. The memcache did successfully cached the key on the update_view part. No idea why the key couldn't be found using that dumping method.

Yet I have another question. Is there a way to get popular list without using your shortcode feature while still being able to use the cache facilities? It just seems the wpp_get_mostpopular() doesn't suit me. Just that I wanna do my own custom html rendering.

I was using it like this:

$args = array(
            'limit' => 20
            'range' => 'last24hours',
            'order_by' => 'views',
            'offset' => ($paged - 1) *20
        );
if(is_plugin_active( 'wordpress-popular-posts/wordpress-popular-posts.php' )) {
      $popular_posts = new WordPressPopularPosts\Query( $args );
            $results = $popular_posts->get_posts();
            foreach($results as $popular_post) {
                $post = get_post($popular_post->id);
                setup_postdata($post);
                get_template_part( 'loop', 'post' ); // deal with the $post in a template, custom rendering...
            }
            wp_reset_postdata();
        }

But in this way it just seems there's no cache access during the query. Is there an easier API for this kind of purpose?

cabrerahector commented 4 years ago

The memcache did successfully cached the key on the update_view part. No idea why the key couldn't be found using that dumping method.

Oh, glad that you were able to figure it out. Thanks for the update, saves me the trouble of setting up Memcached on a Docker container.

Is there an easier API for this kind of purpose?

Yep, the Transients API. Here's a good example of its implementation (which is basically what the plugin does to cache stuff.)

partyspy commented 4 years ago

Thanks. It appears to be a smart move. I'll give it a try:)

partyspy commented 4 years ago

Hey, I have managed to add cache to get popular list. Yet I got some other feedback to you.

As time goes by the wp_popularpostssummary table becomes huge. I've run the plugin for a week or so and the table has gained millions of records, still growing by more than a million records per day. And the query's (mentioned in the opening of this issue) execution time becomes longer. FYI, while running without cache (first time or when cache expires), the query to get popular list in the last24hours costs more than 30 seconds for now. And it appears to cost more as the dataset grows.

Clauses like this seem to be the bottleneck of performance.


SELECT SUM(pageviews) AS pageviews, postid 
    FROM `wp_popularpostssummary` 
    WHERE view_datetime >= '2020-02-19 10:59:42' 
    GROUP BY postid)```
cabrerahector commented 4 years ago

I've run the plugin for a week or so and the table has gained millions of records, still growing by more than a million records per day.

Yep, this is mentioned in the Performance section of the Wiki. That's what the pageviews cache seeks to mitigate. Of course this won't prevent your database tables to continue to grow over time (it'll only slow the growth down), so:

  1. Improve the hardware/software behind your website so it can manage this amount of traffic (load balancers, master-slave/master-master database configurations, etc), or
  2. Have WPP periodically trim its summary database table so it doesn't grow too much for your database server to handle, or
  3. Both of the above.

The plugin has been tweaked for performance over the years but there's only so much it can do on its own. At some point -and your site seems to be a good example about this- you will need additional measures to make your website scalable so performance doesn't become an issue later on.

Clauses like this seem to be the bottleneck of performance.

That's needed so the plugin can fetch the actual views count from the selected time range but if you have suggestions on how to further improve the queries generated by WPP I'm all ears.

You can also try limiting the execution time of the query and see if that helps.

partyspy commented 4 years ago

Thanks for reply. Next up correct me if I mistake.

My site's hard/software architecture is quite fit for now and it just seems too much to upgrade that for a single feature maybe. Anyway I suppose trimming the summary table might not mitigate my problem, since I can gain a million level of records every day and that would have already been a heavy dataset. Currently the pageview count records are generated and grouped on a basis of timestamps (seconds) and even if pageview cached is applied it just mitigate the amount of write operations but not the amounts of records.

Say if some users (like me) might not need the interval granularity to be that small as seconds or hours, and just want that to be the days level, a table without the view_datetime column and pageviews accumulated on date but not datetime would drastically lightweight the dataset. Maybe that's an option to be provided?

cabrerahector commented 4 years ago

a table without the view_datetime column and pageviews accumulated on date but not datetime would drastically lightweight the dataset

This won't solve the issue IMO. That new table would too grow continuously as time goes on -and I know because people complained about this very same issue way before the custom time range feature was added to the plugin- with the only difference being that it'd happen at a slower rate. Eventually this new table if left alone could get large enough to affect your site's performance, and now you have two huge tables sitting on your database.

If upgrading to a more robust server setup is not an option for you right now then caching + limiting the amount of data WPP stores in your database (eg. keep only the last 24 hours of pageviews which is what you're using, discard everything else) are the only things you can do -at this point of time at least- to improve your site's performance.

partyspy commented 4 years ago

That new table would too grow continuously as time goes on -and I know because people complained about this very same issue way before the custom time range feature was added to the plugin- with the only difference being that it'd happen at a slower rate.

If the records were not being written by basically 1 pageview per timestamp, I think the growing rate would be much more slower, no?

keep only the last 24 hours of pageviews which is what you're using, discard everything else

In regard to my situation, the problem is that it's already a large dataset grew in the range of 24 hours (million level amounts of records), query operations on such dataset would be heavy enough. I don't think trimming the data before 24 hours ago would help much.

cabrerahector commented 4 years ago

I think the growing rate would be much more slower, no?

Yep, as I stated before. Then, in 1 month's time for example, you'd have the same problem: the database table has grown too big and you'd be back asking yourself what to do about it.

Note that this isn't an issue specific to WordPress Popular Posts. Any other similar plugin/script that stores a large amount of data on your database will give you the same challenges you're having now because you're introducing a new, resource-intensive feature to your site. It's up to you how to handle that.

partyspy commented 4 years ago

Then, in 1 month's time for example, you'd have the same problem: the database table has grown too big and you'd be back asking yourself what to do about it.

Sure it will grow. But here we're talking about thousands of times smaller as it could have been downsized to. It's just that I took some random post to have a check and the pageview records of single post a day could be thousands, in my situation, since it's based on timestamp. And they could have been grouped into one in some cases like I mentioned above. That's why I assume that. Anyway using this plugin later or not, I am happy to give you this feedback. Keep up the good work mate!

cabrerahector commented 4 years ago

After some thought, there is a quick & dirty way to reduce the number of rows stored in the database while retaining mostly the same level of detail that the plugin currently provides.

As you probably know, when object caching is available and configured WPP will store post/page views in memory (Redis, Memcached, etc) for a fixed period of time (right now that's 2+ minutes) and then batch updates to the database. Maybe increasing the pageviews cache's TTL (eg. 5 minutes) might help a bit? Here's the relevant code if you're willing to experiment.

Do note that increasing the TTL too much (what "too much" is largely depends on your server specs/resources, current traffic, etc so be cautious with this) might cause other performance issues: if you set it too high you may cause your server to run out of memory (the pageviews cache would then be using too much RAM space) or that the INSERT/UPDATE query becomes too long for your database server to parse and execute. So, again, be cautious.

Anyway using this plugin later or not, I am happy to give you this feedback. Keep up the good work mate!

Thanks! I am truly grateful for the constructive feedback you have provided so far. It can only help make WPP better for everyone :)