Low request per second rate for REST endpoint

unnamedfeeling commented 3 years ago

The problem: POST /wp-json/wordpress-popular-posts/v1/popular-posts will hold on until ~35-40 requests per second. Then it will fail to work correctly. Upon load continues - the ttfb will grow until I start getting 5xx errors.

Description: When I am trying to load test the site the mentioned endpoint will be very laggy. TTFB without any traffic is 600-900 ms. Under relatively heavy load it will bump up to 60000 ms. I have read the faq and especially the part with performance and implemented everything I could find from there (see screenshots). I can confirm that redis cache works - there are get and set requests sent to cache backend. When I start testing my containers peak at maximum 20 containers constantly and ttfb starts growing. Also there are many 5xx errors on the endpoint. Screenshot 2021-05-17 at 14 48 32 Screenshot 2021-05-17 at 14 50 31 Screenshot 2021-05-17 at 14 55 50

Stack description: Infra is based on AWS and consists of scalable docker containers (2-20) (based on php:7.3-fpm-alpine), RDS as mysql provider, Elasticache as Rds cache provider. All of this lives behind Load balancer and Cloudfront. I have installed the W3 Total Cache plugin for static page, database and object caches.

Question: What did I do wrong? How can I fix anything to get more performance?

cabrerahector commented 3 years ago

Hi @unnamedfeeling,

If you're already using the Pageviews Cache and the Data Caching option + page/database/object caching then there's nothing else the plugin can do to help with performance.

Also there are many 5xx errors on the endpoint.

What errors specifically?

unnamedfeeling commented 3 years ago

502 and 504 errors are present.

cabrerahector commented 3 years ago

Yeah, those error codes are usually server related (eg. timeout) which makes sense considering that you're running stress tests on it.

unnamedfeeling commented 3 years ago

We have double-checked - this endpoint will not hold even if we give it more mysql and CPU power.

I have checked that endpoint code here - https://github.com/cabrerahector/wordpress-popular-posts/blob/master/src/Rest/ViewLoggerEndpoint.php#L36-L201

Here are some propositions:

change that one from POST to GET and give it some variable ttl (say 10-20 seconds or user-defined value) - in our case it will give true sample-rating and for others it will not count double views for others
maybe rewrite this method to make single write to mysql because even if caching is used, it will make 2 * (amount of views per 2 minutes) to database - in my test case it is about 300-500 writes to DB and it will lock php for some significant amount of time
maybe defer writes to wp-cron say for every 15 minutes

unnamedfeeling commented 3 years ago

Some additions: If I am not mistaken, there are actually:

2 writes to data and summary tables
2 cache writes either to default object cache (in our case it is redis) or DB as transients

cabrerahector commented 3 years ago

Yes, if there's an object cache available then the plugin will first write views data to memory (Redis in your case) and will not write to the database until at least two minutes have passed since the last batch update.

Once two (or more) minutes have passed, WPP will fetch the view data from memory (Redis), do a batch insert into the database (to _popularpostsdata and _popularpostssummary), then reset its memory cache to start caching new views data for the next 2 minutes or so.

Not sure what your point is though :P

cabrerahector commented 3 years ago

Oh, I missed the comment above. Sorry.

change that one from POST to GET and give it some variable ttl (say 10-20 seconds or user-defined value) - in our case it will give true sample-rating and for others it will not count double views for others

Not going to happen. GET requests should be used to retrieve data from the server, not to create it.

maybe rewrite this method to make single write to mysql because even if caching is used, it will make 2 * (amount of views per 2 minutes) to database - in my test case it is about 300-500 writes to DB and it will lock php for some significant amount of time

The code executes two queries, one to update the _popularpostsdata table and another to update the _popularpostssummary table (as explained above).

maybe defer writes to wp-cron say for every 15 minutes

This I could consider. What worries me though is that the longer the wait between updates the longer the batch insert queries will be, and this can cause issues on some server setups (keep in mind that not everyone is using dedicated servers like you) because there's a limit in the length of the query that the server can handle (see Packet Too Large). The current two-minutes approach exists for this very reason.

I could add a filter hook to modify the wait time from 2 minutes to, for example, 15, but then you'd need to watch out for Packet Too Large errors.

unnamedfeeling commented 3 years ago

Okay. Maybe then allow developers to override this method at all (that's what I am considering now). Here's the plan:

implement new action like wpp_pre_handle_views_update_request that will fire before main method code - here https://github.com/cabrerahector/wordpress-popular-posts/blob/master/src/Rest/ViewLoggerEndpoint.php#L38
add filter to return response early - if developer overrides the method he has to be able to disable default behaviour
maybe pass $request to wpp_pre_update_views

Personally I plan to hook up to wpp_pre_update_views action and send json response from there early (but this is a dirty hack).

cabrerahector commented 3 years ago

I'm a bit confused now. How would all that help? How do you plan to use it?

unnamedfeeling commented 3 years ago

My plan is to override the default method, write all data to cache only and then update plugin data from cron. WP cron is disabled for user-accessible part of my project. So I use simple cron to run wp cron and recurring tasks using wp-cli comands. I will test and come back with results. Hopefully it will resolve my problem.

cabrerahector commented 3 years ago

Hmm alright. I'll keep these suggestions into consideration.

unnamedfeeling commented 3 years ago

Hello there Hope that we fixed the problem. There are 2 fixes actually:

I installed apcu in my docker file and thought it was working, but it was not. Installing opcache instead of apcu did the trick - we jumped from 5 to ~35 requests per second with 12 containers running simultaneously
Then enlightment came to me - I wrote some simple php code with direct access to redis cache with persistent connection and configured nginx to handle all post requests to your endpoint using this file. Now only 2 containers can handle ~80 requests per second and running 6-8 containers to handle 250 requests per second is acceptable for us.

Wrapping up, WP with plugins will take too much resources. On my local machine TTFB dropped from 300-500 ms to 20-60. In the end, I have a proposition to implement such custom override - I've got the best experience this way.

In our case the scenario is as follows:

write view update to persistent cache
use cron to collect data from cache and write it to database

cabrerahector / wordpress-popular-posts

Low request per second rate for REST endpoint #289