Automattic / wordpress-activitypub

ActivityPub for WordPress
https://wordpress.org/plugins/activitypub/
MIT License
464 stars 67 forks source link

Content negotiation broken with major caching plugins #783

Open avdi opened 2 weeks ago

avdi commented 2 weeks ago

Quick summary

Hello,

I'm looking for some insights. This may not be a bug with the plugin per se, but I have not found any workarounds so far, and I'm curious what others have done.

I'm hosting a site on CloudWays, which provides Varnish caching. I have tried three different recommended caching plugins, using the CloudWays-recommended settings:

  1. Breeze (CloudWays in-house caching plugin)
  2. W3 Total Cache
  3. WP Rocket

In every case as long as the caching plugin is enabled, I see either the HTML version or JSON version of posts get locked-into cache and "win" based on whichever one was requested first. NONE of these caches seem to respect content-negotiation. In fact having skimmed through the code of these plugins, they seem to go out of their way to disable it by overriding the Vary header!

Unfortunately, with the fediverse being architected the way it is, effective caching is a must-have. With a thousand or so followers and without caching enabled, I see my 32gb Vultr instance grind to a halt every time I post something, as I get inundated with feed requests.

I feel like surely someone else must have encountered this and come up with a solution.

Steps to reproduce

  1. Flush cache, with any of the above-listed cache plugins
  2. Request a post with Accept: application/activity+json
  3. See a cache miss and get a JSON response
  4. Request the same post with Accept: text/html
  5. See a cache hit and get JSON instead of HTML (but with Content-Type: text/html)
  6. Flush cache again
  7. Request a post with Accept: text/html
  8. See a cache miss and get HTML
  9. Request the same post with Accept: application/activity+json
  10. See a cache hit and get HTML instead of JSON

What you expected to happen

Get separately cached versions for HTML and ActivityPub JSON

What actually happened

Get whatever content-type version of the resource happened to get cached first

Impact

All

Available workarounds?

No but the platform is still usable

Logs or notes

No response

avdi commented 2 weeks ago

It seems like this plugin adds ?activitypub and /activitypub rewrites as alternates for the Accept header. Is there any way to have it use those links everywhere in the generated feeds, instead of using the original post/author/comment permalinks? I think this would effectively route around the problem, since cache layers would then see all activitypub requests as being for distinct resources.

avdi commented 2 weeks ago

I'm also wondering if I can accomplish something similar in .htaccess. Has anyone else encountered this issue?

avdi commented 2 weeks ago

So, I guess this is essentially #580. I'm a little surprised that sending the Vary isn't the default; you can't have content-negotiation and caching without Vary: Accept. And you can't do fediverse without caching.

But as was discussed in #580, all the common WordPress internal page-caching plugins [bafflingly] don't honor Vary anyway. I'd really like to know who with a large follower count is using this in production ... and how??

pfefferle commented 2 weeks ago

I am currently traveling... I will answer all your questions when I arrive at the WCEU ☺️

You can try WP Super Cache or Cachify in the meantime. They should both support content-negotiation.

avdi commented 2 weeks ago

Hey, thanks a lot for the reply! I've used Super Cache before, but of course it's the one I didn't try yesterday 😅 (Also, when did it become an official Automattic product?? Somehow I didn't realize that...)

I tried it (successfully), and I also took a look at Cachify, and it looks like they both work to the degree of simply not caching requests for non-HTML content. Which is definitely a step up, but still leaves me scratching my head over what to do when the flock of fediverse seagulls descends with all their feed requests.

Given that none of the caching plugins will actually cache AP content separately, I'm strongly considering going back to my original caching solution and putting in a mod_rewrite rule to redirect anything with an Accept header containing application/activity+json to the .../activitypub variant path. And then excluding that pattern from caching. Curious if anyone has had any success with his approach.

Anyway @pfefferle safe travels, and I'll look forward to your elaboration!

mediaformat commented 2 weeks ago

@avdi for non-html requests you can use WP REST Cache plugin.

avdi commented 2 weeks ago

@avdi for non-html requests you can use WP REST Cache plugin.

I did not know about this plugin, thank you!

janboddez commented 2 weeks ago

Just adding to the list of "supported" caching plugins: Surge can also be set up to separately deal with different Accept headers.

I don't think it'll cache REST API responses, though. Guess you might be able to use WP REST Cache for those. (But you'll still want to also cache ActivityPub [well, and HTML] responses for /author/<name> or whatever you use and individual post URLs!)

Saying this as someone who messed around with this a little while back, although I must admit that I have personally switched to "microcaching" using NGINX's fastcgi_cache, both for HTML and AP and (certain) REST API responses, and that it seems to work well enough even without any (page) caching plugins.

avdi commented 5 days ago

Just circling back here because I got 18 boosts on a post and now my server is once again 100% pegged as it gets thousands and thousands of fediverse requests for the same post all at once. (P.S. how are there THIS many Mastodon servers?!?!)

How are most people handling this? Are you using customized Nginx or Varnish configs downstream to cache AP content? I do have Nginx and Varnish, but my host controls the configuration and it seems like with the headers AP is delivered with out-of-box, the cache layers are ignoring it.

I'd really love insight into how to scale AP with WordPress!

image

pfefferle commented 5 days ago

@avdi I added some more resources to: https://github.com/Automattic/wordpress-activitypub/wiki/Caching

The "I Stopped Mastodon DDoSing Me (I Think)" Article from @kevquirk is worth having a look!

avdi commented 5 days ago

@avdi I added some more resources to: https://github.com/Automattic/wordpress-activitypub/wiki/Caching

The "I Stopped Mastodon DDoSing Me (I Think)" Article from @kevquirk is worth having a look!

Thank you for the links! The links to the relevant PRs are really nice.

Two notes:

1) I got my hopes up, but the article from @kevquirk isn't relevant, unfortunately. It's about a non-Fediverse blog getting pummeled by requests for (HTML) pages when it was linked on a Mastodon account with many followers. So, it's a useful study in making regular HTML pages cacheable, but it's not applicable to WordPress serving AP with content-negotiation. 2) The Cache plugins listed are only patched so far as to not break ActivityPub content negotiation, but only by ignoring AP requests. So they won't address the thundering-herd-of-Mastodons problem.

I'm intrigued, however, by the last link, customizing Surge to serve different variants. This is the only in-WordPress solution I've seen so far.

pfefferle commented 5 days ago

The Cache plugins listed are only patched so far as to not break ActivityPub content negotiation, but only by ignoring AP requests. So they won't address the thundering-herd-of-Mastodons problem.

Good point! I am currently experimenting a bit with Surge, let's see how that works out.