Create a reliable notification & response system for 5xx errors

We use stale-while-revalidate just to attempt to prevent unnecessary delay on cache expiry for the end user. For this purpose, all we need is for the delta between stale-while-revalidate and max-age to be large enough that we would have at least 1 visitor to any given page (for any given cache node) in between the two. Currently the delta is 1 minute (300 vs 360 seconds), which should be sufficient for pages with any significant traffic.

If you want to mask errors, the header to use would be stale-if-error:

stale-if-error={seconds} Indicates the client will accept a stale response if the check for a fresh one fails. The seconds value indicates how long the client will accept the stale response after the initial expiration.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control

I'm wary to do this because I'm worried about errors being masked to us, such that we don't realise until some potentially more significant downstream effect of those errors emerges instead (e.g. a page 500s because its operations are too slow/expensive, then those expensive calls compound and bring down the pods altogether).

If we had another robust mechanism for making sure these errors were brought to our attention in a reliable way, then I'd be more than happy to mask them for the end user. However, at the moment we don't have such a reliable system. We have Sentry, but we get so many errors in there that we can't currently treat new errors coming in as an immediate priority, so this doesn't work for this purpose.

I think we'd need an investigation into what would be the best way to achieve a reliable notification system for important errors before we can add stale-if-error.

canonical / canonicalwebteam.flask-base

Create a reliable notification & response system for 5xx errors #37