Extend the token cache lifetime

bbockelm commented 2 years ago

The public key cache lifetime controls how quickly we can revoke a signer key entirely; it also controls how long of an outage we can sustain.

Based on the discussions in the WLCG authorization working group, we believe 4 days is a better tradeoff than the original 1 day. This is still a finite time -- but allows token issuers to survive longer (e.g., weekend) outages and have the infrastructure continue to operate.

paulmillar commented 2 years ago

I'm fully in support of this pull request as-is.

I think the pull-request could be improved be updating the document to explain the rational of 4 days -- about surviving a server outage.

msalle commented 2 years ago

I'm not sure I understand how this is going to help survive an outage: the access tokens (hopefully) will not have a lifetime of 4 days (we currently set a 6 hour max.), so within those 4 days, tokens will need to be renewed several times in any case, requiring the issuer to be available.

msalle commented 2 years ago

To phrase it differently: I think the change under discussion includes a suggested 4d max for access tokens (rereading the notes from last Thursday). However, with JWT key cache time 4 days and access token lifetime max 4 days there is absolutely no way to revoke a credential within that period of time. For long-lived proxy certificates, we could at least revoke the EEC within 6 hours and with Argus/GUMS etc. even within a much shorter time. That would not be the case here, since everything is offline. I.e. the only way to stop stolen credentials is by all sites would manually blocking those sub claims. I don't think that is acceptable. And if we only raise the caching time, we're back to my previous point.

paulmillar commented 2 years ago

Just to mention that, by itself, extending the duration during which the public keys are cacheable carries some risk (longer is worse if the private key of the OP is compromised), however the specific change in this pull-request is (IMHO) relatively inoffensive.

However, you're right, the stated goal (allowing "the infrastructure continue to operate") only works if the AT validity is also increased.

As background, I understand we're (somewhat) caught between a rock and a hard place.

@bbockelm can, perhaps, provide a more authoritative statement; however, I believe the problem here is that OSG will drop support for X.509-based software on a timescale in which 24/7 support for IAM will not be ready. Current support level for IAM is something like "office hours".

Reliable authentication is a prerequisite during data taking.

So AuthZ-WG are looking how to have OAuth2 survive a possible weekend outage: a problem on Friday late afternoon that isn't fixed until Monday. Hence extending various lifetime durations.

During the meeting, I tried to argue against this ... pretty much for the reasons you described @msalle. Also, the risk is greater because the AT is used as a bearer token, unlike a proxy credential. Therefore our experiences with X.509 may not be the best guide.

Unfortunately, I felt like a lone voice, with others commending this as a "pragmatic solution".

Incidentally, the part about increasing the AT lifetime was also discussed. It was proposed as a "gentlemen's agreement" that we should just ignore the max duration of the AT in the profile.

I also objected to this. I felt that any changes to the document really should be documented, along with the rational. If we plan to use ATs with longer validity that 6 hours (already a tad on the long side, IMHO) then we write this down. I really don't like have unwritten verbal agreements that "we" all know: "yeah, that bit of the profile you just ignore".

We can always be faced with situations where short-term work-arounds are needed. However, we should take steps to make sure they really are short-term. The history of the grid is filled with short-term fixes that nobody can get rid off. i also fear that this change will be introduced and nobody will be able to get rid of it.

msalle commented 2 years ago

Hi @paulmillar I fully agree with you and should have been more active on the calls but simply don't have the time )-: I think the AT/caching proposal goes much further than what we ever had for proxies (for the reasons above: no CRL nor central suspension). I think we should certainly not put this in the WLCG document. At most it could be condoned with a very clear end-time to be agreed upon. I personally don't even like it for that, and think we should not, but I don't see why it should ever become part of this document.

paulmillar commented 2 years ago

I, too, have been remiss and have missed many of the AuthZ-WG meetings. It was by chance that I managed to make this particular meeting.

I'm perfectly happy with this not going in the profile document, but I think it should be written down somewhere, otherwise people have different recollections and therefore different expectations. Perhaps as an "implementation notes" document?

I absolutely agree with having a very clear end-time (as part of our "exit strategy"). This, too, should be documented so people actually enforce it.

msalle commented 2 years ago

I agree if approved it certainly needs to be documented, but not in this standards and best practice document. I'd say not even as an implementation notes. It would be basically a policy decision by WLCG, actually the GDB probably, to temporarily not follow the agreed best practices put forward in the WLCG AuthZ WG common JWT profile document.

DrDaveD commented 2 years ago

I much prefer @bbockelm's alternate suggestion in the meeting, which was to register the public key of a separate non-IAM issuer (which issues tokens directly on the pilot job submission node) in the web service that IAM will use for its public key. If that web service is not HA and 24/7 support, that could be a reason for increasing the lifetime of key caches, while still using short-lived access tokens.

DrDaveD commented 2 years ago

On the other hand, getting HA 24/7 support for a web service is a lot easier to do than it is for IAM, so maybe that could instead be provided in the short term.

jbasney commented 2 years ago

Elsewhere in the profile it says:

The token issuer SHOULD advertise the public key lifetime by setting the appropriate HTTP caching headers. The Client SHOULD use HTTP headers to avoid unnecessary downloads. The recommended lifetime of the public key cache is one day, but SHOULD be kept to less than 7 days. Client implementations SHOULD cache the public key for an authorization server for at least 1 hour, regardless of the server-provided value. Reducing the lifetime of a key will likely impact network traffic.

I think this pull request could be improved as follows:

Update the table and profile text to match: Is the recommended lifetime 6 hours or one day? Is the maximum 4 days or 7 days?
Clarify that the HTTP caching headers take precedence, so long as the lifetime is within the minimum and maximum values. Respecting the HTTP caching headers allows providers to adjust caching as needed (e.g., during key rollover periods).
Clarify cache update times (polling frequency) versus cache eviction times. I see that implementations are polling the JWKS endpoints every 10 minutes (which is too frequent according to the 1 hour minimum in the profile) but caching keys for longer periods (hours/days) in case the JWKS endpoints are offline. This distinction between the two time periods (polling for updates versus evicting stale items from the cache) needs to be clearly specified in the profile.
Clarify that we are caching key sets, not individual keys, so if the issuer removes a key from its key set (e.g., in case that individual key is compromised), that individual key is removed from the cache at the next update interval and does not stay in the cache waiting for the longer eviction period.

msalle commented 2 years ago

I agree with this, in particular point 3, which means we need two separate table rows. It should be made clear that the clients MUST be configured to refresh the keys with short update frequencies, and only use the caching time of 4 days in case the server cannot be reached. My worry is that clients will just refresh the keys very infrequently since it says it is the maximum.

maarten-litmaath commented 6 months ago

Hi all, at least as far as the LHC experiments are concerned, we will need to have further discussions in the AuthZ WG and DOMA BDT meetings, to decide what lifetimes are acceptable for which workflows: there is no one-size-fits-all here. For example, IMO it is perfectly acceptable to let central services use long-lived tokens that are not delegated anywhere. It also is OK for a token to be long-lived (typically up to a few days) if the potential damage from an abuse of that token is sufficiently small, which in turn depends on the exact scopes and audiences of the token. A new issue should be opened to capture the concerns about the current formulations in the profile as well as outcomes from those discussions.

For now, this PR provides a simple change that better reflects the current reality and I will thus merge it, thanks!

DrDaveD commented 6 months ago

Hi all, at least as far as the LHC experiments are concerned, we will need to have further discussions in the AuthZ WG and DOMA BDT meetings, to decide what lifetimes are acceptable for which workflows: there is no one-size-fits-all here. For example, IMO it is perfectly acceptable to let central services use long-lived tokens that are not delegated anywhere. It also is OK for a token to be long-lived (typically up to a few days) if the potential damage from an abuse of that token is sufficiently small, which in turn depends on the exact scopes and audiences of the token. A new issue should be opened to capture the concerns about the current formulations in the profile as well as outcomes from those discussions.

Martin, I don't understand why you're talking about token lifetimes here. This PR was about caching token issuer key info. The title of the PR may have misled you. I would change the title from "token cache lifetime" to "token issuer public key cache lifetime" but I am not given the option so I likely don't have the permisisons.

maarten-litmaath commented 6 months ago

Hi all, at least as far as the LHC experiments are concerned, we will need to have further discussions in the AuthZ WG and DOMA BDT meetings, to decide what lifetimes are acceptable for which workflows: there is no one-size-fits-all here. For example, IMO it is perfectly acceptable to let central services use long-lived tokens that are not delegated anywhere. It also is OK for a token to be long-lived (typically up to a few days) if the potential damage from an abuse of that token is sufficiently small, which in turn depends on the exact scopes and audiences of the token. A new issue should be opened to capture the concerns about the current formulations in the profile as well as outcomes from those discussions.

Martin, I don't understand why you're talking about token lifetimes here. This PR was about caching token issuer key info. The title of the PR may have misled you. I would change the title from "token cache lifetime" to "token issuer public key cache lifetime" but I am not given the option so I likely don't have the permisisons.

Hi Dave, there are several comments earlier in this PR that link the change proposed by the PR to the maximum lifetime of access tokens.

WLCG-AuthZ-WG / common-jwt-profile

Extend the token cache lifetime #17