blogtutor / blog-tutor-support

Custom Support Plugin for NerdPress Clients
6 stars 3 forks source link

Improve Cloudflare Cache Clear Rate Limiting Logic #268

Open blogtutor opened 1 year ago

blogtutor commented 1 year ago

On extremely high traffic sites, particularly ones with multiple authors, rapid Cloudflare cache clears can cause issues. I'd like to find a graceful way to limit cache clearing so that servers don't get overwhelmed, but content still gets refreshed on the front end in a timely manner.

Right now we have a rudimentary procedure that will prevent a "Full" cache clear from happening more often than every 10 seconds.

https://github.com/blogtutor/blog-tutor-support/blob/755a3067f2b16f2d3dd94bde128da5905a5341ff/includes/class-support-cloudflare.php#L356-L365

Over the weekend, I had to increase that to 60 seconds on one site (during the NFL draft). I did this by editing the file directly on the server (slack thread here).

Here's what I'm thinking we could do instead, though I'm open to any smarter ideas!:

  1. Content is updated or published.
  2. If a full cache clear hasn't been triggered in the last 60 seconds, clear the cache immediately.
  3. If the full cache has been cleared in the last 60 seconds, schedule a cron to clear the cache 60 seconds from the last clear (or 60 seconds from the update).
  4. If we do this, we basically guarantee that the cache gets cleared right away if possible, and if not possible, it still gets cleared. And it won't ever clear more frequently than every 60 seconds.

We may also want to add a setting in the plugin to be able to override the default. (Also open to a different default than 60 seconds!)

Thoughts?

blogtutor commented 1 year ago

I should also add a little background:

Any time content is published or updated, we trigger a full CF cache clear. We could be more granular, clearing only specific pages where that content might appear (homepage, related category pages, the next/prev page, author archives) -- but that gets a little complicated and may not be worthwhile. (We also use WP Rocket on most sites, which clears in this manner... so other content should still be cached with Rocket, mitigating the impact of the full CF flush.)

Any time a comment is published/unpublished, we clear the CF cache for just that related post.

jchristopher commented 1 year ago

I should also add a little background:

Any time content is published or updated, we trigger a full CF cache clear. We could be more granular, clearing only specific pages where that content might appear (homepage, related category pages, the next/prev page, author archives) -- but that gets a little complicated and may not be worthwhile. (We also use WP Rocket on most sites, which clears in this manner... so other content should still be cached with Rocket, mitigating the impact of the full CF flush.)

Any time a comment is published/unpublished, we clear the CF cache for just that related post.

Yes this definitely gets tricky the more we think about it e.g. if there is a widget of sorts on N pages that includes links to other pages (i.e. related posts) we wouldn't be able to know where all of that was happening and would likely end up with a (near?) full cache purge anyway. Full cache purges kinda feels like a sledgehammer solution, but I don't see one any better at the moment.

That said, a debounce as described above by @blogtutor sounds like the best next step:

  1. Content is updated or published.
  2. If a full cache clear hasn't been triggered in the last 60 seconds, clear the cache immediately.
  3. If the full cache has been cleared in the last 60 seconds, schedule a cron to clear the cache 60 seconds from the last clear (or 60 seconds from the update).
  4. If we do this, we basically guarantee that the cache gets cleared right away if possible, and if not possible, it still gets cleared. And it won't ever clear more frequently than every 60 seconds.

I can work on a branch to explore this if you'd like?

blogtutor commented 1 year ago

Yes, please. 🙏

Just FYI, right now we're purging the cache by "Prefix"... so it's basically https://domain.com/(.*) for a full clear, or https://domain.com/post-name/(.*) for a single post clear (for addressing comments). For the latter, the ensures that any paginated comment pages are also flushed.

https://developers.cloudflare.com/api/operations/zone-purge

As you dig in, please take a look at the logic for comments. Right now it will clear the cache for that individual URL, but there's no rate limiting. We've seen some issues when people do lots of bulk comment changes (like marking 10+ comments as spam at once), so maybe we should address that at the same time?

blogtutor commented 1 year ago

Also please think about what "overrides" we might want to be able enable quickly on sites with extremely challenging circumstances -- in cases where it's more important to cache aggressively than to refresh content quite as quickly.

This is going to be useful on sites where there's insane traffic and they're rapidly publishing or editing content (and/or lots of comments are coming in).

Idea: Have a default of 60 seconds (or whatever) for the full cache clear rate limit, but then have an option in our plugin's settings to override that.

Idea: Have a toggle so that post/page updates don't triggering a full purge but instead purge just that URL (and maybe the homepage)?

(Sidenote, if we want to clear only the homepage we can't use Prefix clearing.)

jchristopher commented 1 year ago

For your review: https://github.com/blogtutor/blog-tutor-support/tree/268-cloudflare-cache-purge-debounce

In an effort to begin with baby steps, this branch implements a simple debouncing pattern to ensure that individual cache clearing calls happen at a minimum of 60s apart. Here's the heavy lifting for that:

https://github.com/blogtutor/blog-tutor-support/blob/a4791ed73f14d3fa4c7a77cadb8df3886c91d0f2/includes/class-support-cloudflare.php#L340-L378

Note that calls to purge_cloudflare_cache have been updated to call this method instead, routing those calls through this debouncing method.

Also note that the $prefixes act as identifiers for each call purge_cloudflare_cache i.e. we are treating calls to purge_cloudflare_cache that have different $prefixes as separate calls entirely meaning that we will be implementing n debouncing routines, each identified by the $prefixes.

Let's step through how that works:

First we receive the call to clear the cache, accepting $prefixes just as we do in purge_cloudflare_cache

https://github.com/blogtutor/blog-tutor-support/blob/a4791ed73f14d3fa4c7a77cadb8df3886c91d0f2/includes/class-support-cloudflare.php#L398

We then set up our variables:

https://github.com/blogtutor/blog-tutor-support/blob/a4791ed73f14d3fa4c7a77cadb8df3886c91d0f2/includes/class-support-cloudflare.php#L347-L350

By serializing and then hashing the $prefixes we can generate a unique ID for this call, allowing us to have separate debounce handling for each call to clear the cache. This feels like a healthy balance between achieving the goal of throttling how many calls there are to purge the cache, but also allowing a degree of aggressiveness to it.

We check for a transient that has been stored (and only lives for $debounce_threshold seconds) and if there is no transient it means that we do not need to debounce and we proceed to pass along the $prefixes to purge_cloudflare_cache directly.

We then store a transient which contains a timestamp for the next time we want to allow purge_cloudflare_cache to run with these $prefixes, along with the $prefixes themselves:

https://github.com/blogtutor/blog-tutor-support/blob/a4791ed73f14d3fa4c7a77cadb8df3886c91d0f2/includes/class-support-cloudflare.php#L358-L363

If, on the other hand, the transient does exist, we need to retrieve our execution timestamp and the $prefixes needed for the call. We then schedule a WP_Cron event to execute at our desired timestamp:

https://github.com/blogtutor/blog-tutor-support/blob/a4791ed73f14d3fa4c7a77cadb8df3886c91d0f2/includes/class-support-cloudflare.php#L365-L376

Lastly, we implement the hook for that WP_Cron event when it fires:

https://github.com/blogtutor/blog-tutor-support/blob/a4791ed73f14d3fa4c7a77cadb8df3886c91d0f2/includes/class-support-cloudflare.php#L177

Which in turn has its own callback that relays $prefixes to purge_cloudflare_cache:

https://github.com/blogtutor/blog-tutor-support/blob/a4791ed73f14d3fa4c7a77cadb8df3886c91d0f2/includes/class-support-cloudflare.php#L380-L388

Note: the argument retrieval/passing is built into WP_Cron and we're taking advantage of it here which is why $prefixes exists, we passed it in here:

https://github.com/blogtutor/blog-tutor-support/blob/a4791ed73f14d3fa4c7a77cadb8df3886c91d0f2/includes/class-support-cloudflare.php#L371-L376

and WP_Cron considers passed arguments when handling its events. A bit of magic that we're taking full advantage of here. 👍


This is the most simple approach I could think of to begin exploring this idea of debouncing the cache clears. What do you guys think @ecotechie and @blogtutor? I haven't actually tested this code because I don't have the infrastructure set up to do so, but as we wade into these waters I figured that the smallest steps to take at first would be best!

blogtutor commented 1 year ago

All this logic sounds great. I just tried installing it on dev.blogtutor.com and hit a snag, though. I tried doing a full cache clear and that worked fine. The second one, a few seconds later, got a 500 response from admin-ajax.php.

image

Found this in debug.log:

[15-May-2023 22:42:50 UTC] PHP Fatal error:  Uncaught Error: Using $this when not in object context in /redacted/public_html/wp-content/plugins/blog-tutor-support-268-cloudflare-cache-purge-debounce/includes/class-support-cloudflare.php:373
Stack trace:
#0 /redacted/public_html/wp-content/plugins/blog-tutor-support-268-cloudflare-cache-purge-debounce/includes/class-support-cloudflare.php(497): NerdPress_Cloudflare_Client::debounce_purge_cloudflare_cache()
#1 /redacted/public_html/wp-includes/class-wp-hook.php(308): NerdPress_Cloudflare_Client::purge_cloudflare_full()
#2 /redacted/public_html/wp-includes/class-wp-hook.php(332): WP_Hook->apply_filters()
#3 /redacted/public_html/wp-includes/plugin.php(517): WP_Hook->do_action()
#4 /redacted/public_html/wp-admin/admin-ajax.php(188): do_action()
#5 {main}
  thrown in /redacted/public_html/wp-content/plugins/blog-tutor-support-268-cloudflare-cache-purge-debounce/includes/class-support-cloudflare.php on line 373

I also activated the "Debug Bar" plugin and it showed me this notice:

NOTICE: wp-content/plugins/blog-tutor-support-268-cloudflare-cache-purge-debounce/includes/admin-menu.php:204 - Undefined index: page
require_once('wp-admin/admin-header.php'), do_action('in_admin_header'), WP_Hook->do_action, WP_Hook->apply_filters, wp_admin_bar_render, do_action_ref_array('admin_bar_menu'), WP_Hook->do_action, WP_Hook->apply_filters, bt_custom_toolbar_links
blogtutor commented 1 year ago

One other thing to consider: We've had instances where there are two cache clear calls back-to-back (effectively at the same time). Most notably it was happening in two situations:

1) When hitting "Update" in the block editor; sometimes it would just trigger multiple cache clears, still don't know why (never happened in Classic)

2) When someone is approving (or unapproving) multiple comments on the same post using the bulk edit tool.

In those cases, it actually has been better to just hit the API with the request several times in rapid succession; it would be less performant (and unnecessary) to hit it once, and then hit it again 60 seconds later... So we should see if we can account for that somehow?

jchristopher commented 1 year ago

Whoops, sorry about that, fixed the context usage in https://github.com/blogtutor/blog-tutor-support/commit/79acd346ec08e3d0e11833c2467b7a593d08596a

I'll noodle on https://github.com/blogtutor/blog-tutor-support/issues/268#issuecomment-1548716968 a bit to see how we can elaborate on the conditions for the debouncing.