[Feature] Sync cache invalidation between Magpie/Twitcher

fmigneault commented 3 years ago

Description

Because permissions are applied onto Magpie and are resolved for request-access by Twitcher instead, any already active caching of older requests will not be immediately synchronized on Twitcher side right after the permission update on Magpie side (caches are not shared, and therefore not invalidated on permission update).

Following requests to the same resources (within the caching expiration delay) will hit the cached response triggered by the access during the previous requests. New permissions resolution will not be effective until the cache expires. For example:

GET /twitcher/ows/proxy/thredds/<file-ref>   
    => denied, response is now cached
PUT /magpie/users|groups/{id}/resources/<file-ref-id>/permissions  
    => <file-ref> made 'allowed' for previous request user
GET /twitcher/ows/proxy/thredds/<file-ref>  (cached)  
    =>  denied instead of allowed
... wait ~20s (caching delay) ...
GET /twitcher/ows/proxy/thredds/<file-ref>  
    =>  allowed

Note that the effect goes both ways, i.e.: removing access to a resource will not be blocked until the delay was reached.

Edit:

For the invalidation to take effect on Twitcher side, there are ~2 methods~ 3 methods:

Explicitly set cache-control: no-cache header during the next file access request to enforce reset of cache. This works, but should be done only on the first call after permission update, otherwise all caching performance advantages are lost over many repeated access to the same resource.
Share the Magpie/Twitcher caches via file references to allow them to invalidate each other.
- To do this, we need to have a volume mounted by both images, and have both of them use cache.type = file + corresponding paths for cache.lock_dir and cache.data_dir in their INI configs.
- More updates to Magpie/Twitcher will be needed to correctly invalidate caches of type file (only memory is tested, and they are not hashed the same way for selective invalidation - e.g.: invalidate only ACL for resource X / user Y, etc.).
(best) Employ redis or mongodb extension with beaker to synchronize caches. https://beaker.readthedocs.io/en/latest/modules/redis.html https://beaker.readthedocs.io/en/latest/modules/mongodb.html Not only would this allow to sync or invalidate caches across Magpie/Twitcher, but also between individual workers of Magpie and Twitcher. At the moment, each worker holds its own in-memory cache depending on which requests it received, meaning cached/non-cached responses won't be the same (and won't expire at the same time) depending on which worker processes the request and when was the last one received by it.

References

This branch: https://github.com/Ouranosinc/Magpie/tree/invalidate-perm-via-group provides an initial draft of cache invalidation following permission update. It still is missing updates for cache.type = file, but invalidation occurs as expected with memory on Magpie-side.
https://github.com/Ouranosinc/Magpie/issues/462 is a temporary "resolution" description if such that we consider this caching situation a known issue and that won't be adjusted according to above shared-volume method.
Following PR: https://github.com/Ouranosinc/PAVICS-e2e-workflow-tests/pull/87 bypasses the issue by enforcing cache-control: no-cache header during tests. Failing tests in https://github.com/bird-house/birdhouse-deploy/pull/188#issuecomment-895562433 work without the cache.

dbyrns commented 3 years ago

For me the cache does not need to be shared or synchronized. It is pretty common on the web to have to wait a couple of seconds or minutes before a new authorization takes effect. The only problematic scenario is when running tests that don't take that into account. So for me the solution #1 is the good one with one caveat. If a service (A) calls itself another service (B), even if the no-cache header is set when calling (A), there is no guarantee that (A) will set this header while calling (B). So the B request could still hit the bad cached value. So the tests need to be carefully crafted when updating permissions so that no waterfall calls occur when setting the no-cache header or take into account the time required before the cache expires.

fmigneault commented 3 years ago

If service (A) calls a protected service (B), it should technically forward at least the Set-Cookie header already (or other cookie / auth containers based on impl.) in order to gain the same access user/identity, otherwise it assumes (B) is always public. It should also forward to Cache-Control to be sure.

I think that this is not a very common nor problematic case though. Most probably, either the endpoint called on (B) by (A) is a static path, so the endpoint shouldn't change permission, or it is a "same resource" path (eg matching WMS for a Thredds file), so permissions should be changed similarly at about the same time to reflect the desired access. The caches of both resources should also reset around the same time if last cached request successively called (A)->(B).

I find also that changing a permission on (A) from/to secure/public impacts only during that kind of "same resource" case, considering below "initial" permissions.


A (secure)	B (secure)	Both need permission update to be public, about same delay until reset cache takes effect if permissions are applied on both. Otherwise, bad permission setup.
A (public)	B (secure)	(B) always blocked to begin with, doesn't matter that A becomes secured, because token was always required to begin with.
A (secure)	B (public)	(B) is assumed public, it remains public. Will be callable publicly via (A) after cache reset. Making (A) public doesn't change access to (B) that could be called directly.
A (public)	B (public)	(B) is assumed public, it remains public, just (A) that becomes blocked after cache reset. Technically, nothing wrong. (A) can still access (B), but via public. To do things correctly, permission on (B) should be changed also if they should match blocked access.

dbyrns commented 3 years ago

It is indeed a edge case. I was just pointing out that one should keep that in mind when writing tests. The exact usecase I had in mind that you didn't mention was more a WPS (as service A) trying to access data (thredds as service B) or shape (geoserver as service B) or Weaver (as service A) calling other WPS (as service B). I would be surprised if our WPS forward the cache-control header.

But again, I assume that WPS tests will not include permission changes and will rather use public data/shape. It is really just a reminder how deep the cache can trick us.

tlvu commented 3 years ago

It is pretty common on the web to have to wait a couple of seconds or minutes before a new authorization takes effect.

I am with DavidB on this one. I think this is a very nice to have but we do not have to resolve it right now.

bird-house / birdhouse-deploy

[Feature] Sync cache invalidation between Magpie/Twitcher #189

Description

References