WoltLab / WCF

WoltLab Suite Core (previously WoltLab Community Framework)
https://www.woltlab.com
GNU Lesser General Public License v2.1
238 stars 142 forks source link

RFC: Overhaul Caching: Eager and Async Rebuilds #5911

Open dtdesign opened 4 months ago

dtdesign commented 4 months ago

The current caching strategy has three major issues on runtime:

  1. Resetting a cache causes the next request that needs this cache to trigger a synchronous rebuild.
  2. Non-critical caches can sometimes be expensive to generate, causing dips in response times.
  3. The same rebuild can take place simultaneously by concurrent requests. This is especially bad for expensive caches (see 2.).

Caches reset by (1.) are always the result of a change caused by the current request. Those requests also have a higher response time due to the amount of work being performed, for example, creating a new message, compared to typical read-heavy requests.

A prime example for the second issue (2.) is the statistics cache on the forum list that – among other things – counts the number of posts. This is an expensive query because in InnoDB this number is not denormalized, causing significant query runtimes for forums with a large number of posts.

The issue (3.) is mostly about efficiency.

Eager Cache Rebuilds

We can solve (1.) and (3.) by immediately triggering a rebuild of a cache whenever it is being reset. As explained above, those requests are already slow in comparison due to the amount of work already been done.

Generally speaking, users don’t expect those actions to be lightning fast and adding 10, 20, 50 or even 100ms of latency doesn’t do much. At the same time, adding 100ms of latency to a page that usually loads in 50ms makes a huge difference.

If the cache does not exist, we must fall back to a synchronous rebuild of the cache just like before. This should only happen for legacy cache implementations or hard cache resets.

“Pure” Cache Builder

An eager cache rebuild requires the implementation of the cache builder to be “pure”. This means that it MUST NOT rely on any (runtime) cache at any point because those could be stale. Any data must be fetched live from the source of truth, usually the database.

Async Rebuilds

This is a special type of cache that differs in two ways from a regular cache:

  1. Loading a stale cache is always acceptable.
  2. A stale cache should trigger a unique background job to rebuild that cache.

We should consider a “refresh ahead” pattern for this type of cache that will preemptively queue up a cache rebuild when the cache is nearing its lifetime. For example, a cache with a TTL of 5 minutes could be rebuild after 4 minutes of its existence, reducing the likelihood of it becoming stale.