coderedcorp / wagtail-cache

A simple page cache for Wagtail based on the Django cache middleware.
Other
87 stars 29 forks source link

Cache does not work if timeout is set to None #69

Open gwenya opened 4 months ago

gwenya commented 4 months ago

According to the Django documentation, a value of None for the cache timeout should mean that entries in the cache never expire (see https://docs.djangoproject.com/en/5.0/ref/settings/#std-setting-CACHES-TIMEOUT).

Using this appears desirable for wagtail cache, if we use cache invalidation, since there is no reason for a page to be re-rendered every hour/day/month/year/whatever.

However, wagtail cache will just not store anything into the cache if the cache timeout is None: https://github.com/coderedcorp/wagtail-cache/blob/79638e9a1c707f2d88154cd345200966e6a458e0/wagtailcache/cache.py#L299

There is also an exception in the admin on the cache settings page because the code for formatting the timeout in a human-readable way cannot deal with None-values:

https://github.com/coderedcorp/wagtail-cache/blob/79638e9a1c707f2d88154cd345200966e6a458e0/wagtailcache/templates/wagtailcache/index.html#L28

https://github.com/coderedcorp/wagtail-cache/blob/79638e9a1c707f2d88154cd345200966e6a458e0/wagtailcache/templatetags/wagtailcache_tags.py#L13

It is of course possible to just set the timeout to an arbitrarily high value, but supporting None seems nicer.

vsalvino commented 4 months ago

Thanks for pointing this out. Our implementation does deviate from the Django behavior.

Since wagtail-cache sets HTTP Expires and Cache-Control max-age headers, and those must be a date in the future, it is not possible to support indefinite expiry. For example, the server will tell the browser that this web page expires on a certain date and time in the future. Usually this is a few hours, or a few days. Of course, if you change anything on your website, the user will not see these changes until the Expires date/time. So it is usually not advisable to cache it too long. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expires.

The Django cache is more of an internal cache, it is not doing a front-end HTTP cache based on that. So that's why they are able to support the None value.

I think we should update our documentation to indicate that this cannot be None, since that deviates from the Django behavior.

gwenya commented 4 months ago

Hmh on the one hand that makes sense, on the other hand it appears that wagtail-cache is doing two completely unrelated things and mixing them up a bit then. Ideally there would then be two separate timeout values, for example one might want to cache responses on the server indefinitely so that they don't need to be rerendered unless the data changes, but have much shorter caching on the client (or none at all) since the browser cache cannot be easily invalidated.

The only sensible way for pages to be cached on the client for my use case (and I think many others) is with etags. I will investigate whether it is feasible to do this with wagtail-cache.

vsalvino commented 4 months ago

Good idea about the separate settings. Looks like the Django cache middleware uses a separate setting for this: https://docs.djangoproject.com/en/5.0/ref/settings/#cache-middleware-seconds

Etags are also a great idea! On one hand, we could probably use the wagtail revision history for this. However, things like images, static files, etc. might make this more complicated (i.e. what if a page is not changed, but one of the images on the page is changed; what if page content is not changed, but the code/template is changed?).

gwenya commented 4 months ago

I don't think the revision history works for that, as you say there's the images but also page templates can include data from other pages so then you'd have to take those into account as well. Code/template changes on the other hand seem fairly easy to deal with, if the etags include some piece of information that changes on every deployment (e.g. some random number that gets generated via a manage.py command).

I feel like this is definitely a thing that cannot work out of the box and requires additional work for integrating, like user-provided functions for calculating the etag.

I am thinking the easiest way perhaps is for the cache to store the etag as a hash of the response next to the response, then when a If-None-Match request comes in it could check if there's a cache entry and if so compare the etag. The same could also be done with a timestamp for If-Modified-Since requests (using the timestamp of when the response is stored in the cache, so that when the cache is invalidated the browser will get a new version).

vsalvino commented 4 months ago

All good ideas. Yea it will end up being complicated to calculate etags. My philosophy is to set the cache TTL for like 24 hours. Media TTL for 7 days (usually served separately from wagtail). Even if you have 10,000 pages, they're only getting rendered once per day, and it's still a super-low server load.

If you come up with any good ideas for etags, feel free to post them here, or work up a concept pull request. I'd be happy to take a look.

In the mean time, we should either update the documentation about the timeout, or use a separate setting.

gwenya commented 4 months ago

Yeah I'm not really worried about server load, I might just be over-optimising a bit :D

A separate setting would really be ideal.

But I will look into the etags, because I'm really curious about how to make that work in a sensible way.

gwenya commented 4 months ago

I am noticing that my browser does not actually seem to cache the page responses, only resources. Presumably this is mainly relevant for caching proxies like varnish?

gwenya commented 4 months ago

Working on #71 has let me to realise that there is an issue with the Cache-Control and Expires headers.

Since these headers are set before adding the response to the cache, and then not set again after retrieving it, a response that is served from the cache will have an Expires header that does not match the max-age in the Cache-Control header.

Without #71, this means that when the cache timeout is set to e.g. 24 hours, then a request that comes in 23 hours after a matching response was cached will get that cached response with an Expires header that indicates that the response expires already one hour later, but with a Cache-Control header that indicates that the response is valid for another 24 hours. When a max-age in Cache-Control is present then clients should ignore the Expires header, so this will lead to the client caching the response for 24 hours. Therefore this response in the client is still considered valid 47 hours after it was initially cached, despite the cache timeout being set to 24 hours.

With the cache timeout and max-age being decoupled, this might be less of an issue since a lower max-age can be used, but the discrepancy between Expires and max-age remains.

Perhaps the FetchFromCacheMiddleware should adjust the Expires header on a retrieved response?

gwenya commented 4 months ago

Oh it seems that is what the Age header is for, so that is probably what should be set.