apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.56k stars 768 forks source link

[Object_store] min_ttl is too high for GKE tokens #6625

Open mwylde opened 3 hours ago

mwylde commented 3 hours ago

Describe the bug

When using object_store on a GKE pod with workload credentials, we see a huge volume of requests to the metadata endpoint to refresh the token (this appears in the log as a stream of "fetching token from metadata server" lines, within 1-2 ms of each other). This can overload the metadata service, preventing future work within the service.

This is caused by the implementation of the TokenCache:

https://github.com/apache/arrow-rs/blob/a9294d7b06ce230f738c8bef25a1fd9a3b3e095c/object_store/src/client/token.rs#L64-L76

The token cache is supposed to prevent multiple requests to fetch the token by reusing a cached token. However, if the token is close to expiry (within the min_ttl time) it will attempt to refresh it, then set the new token in the cache.

In this case, what's happening is that the GKE metadata service returns the same token for every call up until ~5 minutes before expiry, at which point it generates a new token with an expiry of 1 hour. But min_ttl is hard-coded to 5 minutes (300 seconds).

This creates the potential for a race condition, where if a high volume of calls come into the object_store at ~5 minutes until expiry, they may each:

  1. Lock the mutex
  2. Observe the cached token is near expiry
  3. Get a new token (which is the same as the old token, with the same <5min expiry time)
  4. Save that in the cache
  5. Release the mutex lock

which is what we observe from our logs. If enough requests come in one will overload the service, leaving the mutex locked and preventing any further use of the object_store. For reasons I don't quite understand, the requests don't seem to ever time out, leaving the store stuck until we restart the service.

To Reproduce

Run a service (for example https://github.com/ArroyoSystems/arroyo) on GKE with a workload identity writing to GCS. Make a high volume of parallel requests to the object store, wait an hour, see that many requests are made to the metadata service.

Expected behavior

Only one request should be made to the metadata service.

Proposed solutions

A simple fix is to just reduce the min_ttl for GCS to <=4 minutes. However, I think it's dangerous to rely on the exact behavior of the token generation in a generic subsystem like the token cache. A better solution might look like an asynchronous refresh process that's kicked off when the min_ttl is hit, and runs (with appropriate backoff) until it successfully gets a token with expiry > min_ttl. This would also avoid the latency impact of doing token fetching within the request itself

tustvold commented 1 hour ago

Hmm... This is kind of unfortunate, we could make the min ttl configurable but this is as you say not an ideal solution. Another option might be to throttle concurrent metadata requests, but again this is just moving the problem.

Can this behaviour of workload identity be configured perhaps, it does seem pretty bizarre to me? It effectively means you can't reliably get a fresh credential nor one that is valid for more than 5 minutes, which seems fairly limiting? Even if we did the fetching as an asynchronous job, we'd run into similar issues.

Perhaps there is some query parameter we could add to force it to generate a new token?

I'm going to change this to an enhancement, as it isn't really a bug, but an enhancement to workaround a limitation of some other component.

mwylde commented 1 hour ago

I might not have been clear. When you first request the token, it has a TTL of 1 hour. However, the token is cached locally by the metadata service until it has 5 minutes of time left at which point it will generate a new one:

Access tokens expire after a short period of time. The metadata server caches access tokens until they have 5 minutes of remaining time before they expire. If tokens are unable to be cached, requests that exceed 50 queries per second might be rate limited. Your applications must have a valid access token for their API calls to succeed.

So if you just reduce the min_ttl to under five minutes, it will probably be ok. There doesn't seem to be a way to force the metadata service to give you a new token (although I'm far from a GCP expert so there might be something I've missed in the docs).

I do think this is a bug, because currently on GCP you can end up in a situation where object_store is stuck on a lock and unable to make progress, apparently indefinitely. And because the min_ttl isn't configurable from outside the library, there is no workaround except modifying the code.

I think the key thing that needs to be fixed is that the cache will happily overwhelm the metadata service with requests when the token being returned expires in <min_ttl.