Locks up WordPress when receiving Replies, Re-post, and/or Favorite

RobertWaterloo commented 11 months ago

Quick summary

When a post from WordPress is re-tooted, favorited, or replied to WordPress locks up. cPanel indicates all memory consumed (1 GBytes of 1 GBytes), and all I/O is consumed (10 MB/s of 10 MB/s)

Steps to reproduce

Activate ActivityPub plugin
Create new post
Reply, Re-post, or Favorite from Mastodon

What you expected to happen

WordPress should not lock up, and display a Reply in the comments of the post if a Reply was sent

What actually happened

WordPress locks up when a post is Replied to, Re-posted, or Favorited, cPanel indicates all memory in use (1 GByte of 1 GByte), and all I/O is consumed (10 MB/s of 10 MB/s)

Remediation requires all plugins to be removed from the /wp-content/plugins folder, and the /index.php to be removed (or renamed). After a few seconds the index.php and plugins can be restored, and plugins re-activated from WordPress.

Impact

All

Available workarounds?

Yes, difficult to implement

Logs or notes

Site is https://radiowaterloo.ca hosted on a shared webhost.

Currently the ActivityPub plugin is deactivated, contact bob@radiowaterloo.ca if you want to troubleshoot on this server.

PHP version is 7.4

Other plugins activated:

add-login-text/
akismet/
amazon-s3-and-cloudfront/
authors-list/
classic-editor/
classic-widgets/
contact-form-7/
display-posts-shortcode/
insert-php-code-snippet/
lh-private-content-login/
login-logo/
maxbuttons/
one-user-avatar/
peters-login-redirect/
redirection/
simple-ajax-chat/
upload-max-file-size/

janboddez commented 10 months ago

Had something similar happen on a small VPS. Boosting one of my WP posts to my couple hundred Mastodon followers would almost immediately lead to 100% CPU usage and PHP-FPM hangups. Problem got less when I moved to 2 vCPUs. Also, configured PHP to (I think) restart itself when unresponsive for too long.

Ever since, I've been looking for a decent page/response caching plugin that will cache both HTML and JSON representations of the same post. I think Surge may be it, in combination with a custom cache config.

What will likely also help is optimize the no. of PHP workers and so on.

Before, but on a shared webhost, sometimes ActivityPub posts wouldn't even get sent out reliably, and I'd regularly see "max no. of database connections" type of warnings.

Long story short: you're likely hitting your server's limits. It could be resetting itself, too. Like, could be just waiting's enough to "fix" the "hangup."

mediaformat commented 10 months ago

One possibility would be to implement LD-signatures. When messages are signed this way, other servers accept them without looking up the post, this would help with the server load.

RobertWaterloo commented 10 months ago

Long story short: you're likely hitting your server's limits. It could be resetting itself, too. Like, could be just waiting's enough to "fix" the "hangup."

Yup, I'm reaching that conclusion too. The ActivityPub plugin is currently deactivated, but we're still getting the occasional "Database unavailable" message. So I think we need to upgrade our hosting package.

--Bob.

janboddez commented 10 months ago

I don't think replies were ever an issue for me; I'd say the JSON API should be able to deal with the occasional incoming request. A beefier server might be the only solution there. Or running fewer/higher-quality plugins, etc. I don't think this is correct anymore. :-) A reply (or like) will probably also lead to all of [the comment author's] followers' servers fetching your post. Same effect as a boost. (Edit: Not sure about the like.)

Only thing, however, that helped my tiny server survive boosts (where multiple servers will query the typically uncached JSON representation of a post), was to ... add some form of caching.

One important aspect is to not serve JSON to regular site visitors, and not serve HTML to Masto instances. I've found the Surge plugin can be set up this way.

Other plugins, like WP Super Cache, may be caching HTML but not JSON. They help a bit---like, your server will likely still be able to serve "regular" pages while it's being "hammered" with application/activity+json requests---but not nearly as much.)

But I was able to get the best performance by using NGINX's FastCGI or reverse proxy caching feature. Even shortish (1 minute or so) caching durations help survive boosts; a longer-lasting cache (effectively turning bits and pieces of a WordPress site into a static website) will be slightly more difficult to invalidate but improve your site's performance by ... a lot.

Configuring PHP to restart itself after a crash is probably still a good idea.

(Disclaimer: not a sysadmin/devops engineer or anything; just mentioning what seems to have worked for me.)

YourAutisticLife commented 8 months ago

I think I'm experiencing this bug. I'm using the Docker image called wordpress:latest for my sites. This one uses apache internally. I've noticed a TON of apache2 processes launched when the problem happens, and I've been able to trace them to the WordPress instance which had its post boosted.

pfefferle commented 8 months ago

The problem is, that it is not really a "bug" but how ActivityPub works. So to "fix" this we have to report that back to improve the spec, I think.

YourAutisticLife commented 8 months ago

I don't get it at all. I'm running two WordPress instances, and one Mastodon instance, on the same Linode virtual machine. They all use ActivityPub. If it is ActivityPub that demands this result, then Mastodon is clearly doing something to avoid killing my CPU.

pfefferle commented 8 months ago

Oh, that's new to me! So your main public Mastodon server is on the same machine, where posts of you were boosted and liked?

pfefferle commented 8 months ago

Du you see any errors in the logs? Maybe it's a Webserver/PHP thing?

YourAutisticLife commented 8 months ago

Yes, it is all on the same machine. I've looked at the logs and so far this is the only thing that seems related to it, but I'm not even sure:

wordpress-wordpress-1  | [Fri Jan 05 20:06:08.045291 2024] [php:error] [pid 428] [client 2604:a880:400:d0::1f2f:3001:0] PHP Fatal error:  Uncaught Error: Call to undefined method WP_Error::get_url() in /var/www/html/wp-content/plugins/wordpress-activitypub-master/includes/transformer/class-post.php:124\nStack trace:\n#0 [internal function]: Activitypub\\Transformer\\Post->get_attributed_to()\n#1 /var/www/html/wp-content/plugins/wordpress-activitypub-master/includes/transformer/class-base.php(57): call_user_func(Array)\n#2 /var/www/html/wp-content/plugins/wordpress-activitypub-master/includes/transformer/class-post.php(56): Activitypub\\Transformer\\Base->to_object()\n#3 /var/www/html/wp-content/plugins/wordpress-activitypub-master/templates/post-json.php(5): Activitypub\\Transformer\\Post->to_object()\n#4 /var/www/html/wp-includes/template-loader.php(106): include('/var/www/html/w...')\n#5 /var/www/html/wp-blog-header.php(19): require_once('/var/www/html/w...')\n#6 /var/www/html/index.php(17): require('/var/www/html/w...')\n#7 {main}\n  thrown in /var/www/html/wp-content/plugins/wordpress-activitypub-master/includes/transformer/class-post.php on line 124

YourAutisticLife commented 8 months ago

The error I reported above might have to be fixed, but I think it is a red herring when it comes to figuring out what is happening here.

I've reproduced the problem on a test server. What I'm seeing is a bunch of requests coming from elsewhere. I don't think this can be avoided. I think the CPU is more likely to be hammered the more interest there is towards the article. If I'm boosting an article, I guess this will be a function of the hashtags used in the article, and the number of followers I have.

(For the record, I've posted smaller articles that did not have that many hashtags, and boosted them with another account, and these did not hammer the CPU.)

I've been looking at Mastodon's nginx configuration and what I'm seeing there is some serious caching of the requests, which may explain why Mastodon does not eat up the CPU.

janboddez commented 8 months ago

I've been looking at Mastodon's nginx configuration and what I'm seeing there is some serious caching of the requests, which may explain why Mastodon does not eat up the CPU.

I saw those lines too, but when I browse around my Masto account, it seems to use Cache-Control headers (or similar) for pretty much everything. I don't think, given Mastodon's near real-time nature, that logged-in users see a lot cached responses. And I don't think remote servers do either, given how fast boosts etc. propagate. (I mean, the config seems to suggest that responses get cached for 7 days. And the application doesn't really seem to contain logic to invalidate these, yet I'm not seeing outdated content anywhere.)

That said, I've been playing with both cache plugins and NGINX's FastCGI cache (which uses directives very similar to Mastodon's proxy_cache directives). (I have Surge, a page caching plugin, set up to cache responses for 1 hour, and that includes JSON responses, and then NGINX does the same but for like 1 minute.)

I have been experimenting with crawlers to warm the cache (bypassing NGINX's cache so as to warm up only the longer-lived "PHP" cache) but I'm not too sure what to think.

Either way, the idea was for requests to hit PHP as little as possible, at least during traffic spikes, and if they do, to mostly bypass WordPress. Surge is pretty good at cache invalidation, so that's why I trust it with a longer TTL. (NGINX cache invalidation is somewhat harder to get right, at least for the open-source version, and I mostly don't want to bother with it, hence the shorter TTLs.)

I should probably add that I'm in no way an expert on the matter!

I will, however, gladly share my server config file once it's cleaned up a bit. It would probably help to have others review it!

Maybe WordPress is just way slower than Mastodon? (I really wouldn't know.)

RobertWaterloo commented 8 months ago

@pfefferle wrote:

The problem is, that it is not really a "bug" but how ActivityPub works. So to "fix" this we have to report that back to improve the spec, I think.

Having analyzed my problem a bit further, I agree that it's not a bug in the WordPress ActivityPub plugin that needs fixing. And I suspect improving the ActivityPub spec is out-of-scope for this github project.

So, this issue can be closed as "Not a bug"

--Bob.

RobertWaterloo commented 8 months ago

Not a bug in the WordPress ActivityPub plugin; a fix for the issue I experienced is out-of-scope.

YourAutisticLife commented 8 months ago

ActivityPub does not need fixing, as evidenced by the fact that a Mastodon instance of mine running the same VM as WordPress + AcitivtyPub does not completely eat my CPU whenever something is boosted.

My proxy cache settings are now these:

        proxy_cache_key "$request_uri $http_accept";
        proxy_cache_valid 200 7d;
        proxy_cache_valid 410 24h;
        proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
        add_header X-Cache $upstream_cache_status;

The proxy_cache_key is critical, as nginx does not by default care about content negotiation when it comes to caching. The fact that I was using the default key explains this issue:

https://github.com/Automattic/wordpress-activitypub/issues/641

I'm no longer getting the CPU spikes, however I'm not sure that I haven't broken something else.

pfefferle commented 8 months ago

@YourAutisticLife awesome!

Would you mind doing a PR that adds your findings as an FAQ?

(If you're sure you haven't broken anything of course ☺️)

janboddez commented 8 months ago

FWIW, here's my NGINX config: https://gist.github.com/janboddez/9220f7e558c3f38c42cd3d535575e98b#file-nginx

Note that I use PHP-FPM rather than mod_php.

You'd basically need another plugin to purge the right URLs when you update anything, which is tricky, which is why I don't bother and instead have my NGINX cache expire quite fast. 7 days---unless you have cache invalidation figured out---seems like a very long time. Plus, if someone boosted a post older than that, there's still a chance than hundreds of requests ... hit your back end (I think, I have the equivalent of proxy_cache_use_stale updating; in my settings and it's still not a 100%).

So, yes, my site (non-cached pages) still occasionally feels slowish during "stampedes," despite the cache lock and so on. (But not always, so who knows what else is going on. E.g., I'm also running Caddy in front of NGINX, for example ...)

Also, got to prevent showing cached pages to logged-in users, and prevent these pages from being cached. And so on.

I also read you may want to ignore upstream Cache-Control and Expires headers, to not keep serving deleted pages (you'll want to return a 404 for those, or Delete activities may [?] stop working).

Anyway, I'd love to see other's configs.

YourAutisticLife commented 8 months ago

@pfefferle As I suspected, my solution is too eager to cache. I modified a page of mine and the modification did not show up in the actual page. I had to restart nginx with the old configuration. I tested modifying the page again and this time it worked.

@janboddez 7 days is a default. I suspect Mastodon uses caching headers to avoid having data stay in the cache that long.

janboddez commented 8 months ago

Yes, currently testing with TTLs of ~1 minute ("microcaching"), which should be enough for surviving a "stampede" while not serving stale content for too long. Even 30 seconds might suffice. Thinking it's OK for a request to hit PHP once in a while, as long as the large majority of requests get served from the cache.

(Note that I also try to preload Surge's page cache, and have set a much longer TTL for it. So that for certain pages, if NGINX's cache is due for an update, Surge will return the version it has cached. [Surge comes with a pretty good invalidation mechanism.])

I also updated my config to (micro)cache all of /wp-json/activitypub. That seems to bring response times for, e.g., outboxes more or less in line. (Although I'm not sure when outboxes get polled at all.) Depending on the concurrency level.

Should probably try to better capture what requests are made when a post get boosted. And then play around with ApacheBench. Right now a lot of it is based off guesswork.

What I do see is that, if I unleash, e.g., ab -c 40 -n 10000 at an uncached (by NGINX) outbox, it takes ~seconds to respond, on average. (Not sure why Surge isn't doing a better job caching them!?)

But if do have NGINX cache those URLs, 99% of requests only take 200 ms anymore, which is about as good as it gets. (The longest request still takes ~1.5 s, so I'm guessing that's one of the first ones, if not the first one. As mentioned, I don't understand this cache lock too well. ;-P)

YourAutisticLife commented 8 months ago

This morning, I switched my docker image from worpress:latest to wordpress:fpm. So I went from using the implementation that uses Apache as the HTTP handler inside the docker image, to one that uses nginx together with php-fpm. Oh, and this switch was not particularly difficult.

Just doing this made a significant difference. I know I need to do more. I'll check out what @janboddez uses for the nginx configuration.

janboddez commented 7 months ago

I also updated my config to (micro)cache all of /wp-json/activitypub.

This seems to have had an impact; I'm not really facing any issues right now (fingers crossed). One thing I've done as well is mess around with keepalive, I'm not sure if was worth it, though!

Might start looking into rate limiting potentially malicious (non-Fediverse) actors and blocking some (non-Fediverse) bots I don't care about to further free up (some) resources. If you're on managed/shared hosting, your provider may already be doing all of that and more. (Problem I had there was that I was hitting their resource limits and upgrading wouldn't help.) I expect this to have a minimal impact, though, as browsing my site's backend is normally sufficiently fast.

Either way, getting close!

YourAutisticLife commented 7 months ago

A few comments:

Switching to php-fpm made a big difference. With the Apache implementation, my site became unresponsive for easily one minute. I'd hit reload and the browser would give up. I also saw kswapd working like a madman and eating up my CPU in addition to all the apache2 processes. After the move, my site is a bit slow, but I can reload pages, and kswapd is there but not at the top in terms of CPU.
I'm uneasy with using nginx as a bandaid. Ultimately, the logic that produces the JSON should set the caching headers as needed. nginx cannot guess what the original intent was. I've written code that produced appropriate caching headers, but that was several years ago, and prior to getting chemo brain. (I'm fine. 3 years in remission. My autism is a bigger annoyance right now.) I guess Mastodon itself could be used as a reference, but I know that's a significant undertaking. (I think I have four PRs submitted to Mastodon's code base dealing with much simpler things than caching.)

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Automattic / wordpress-activitypub