buildbarn / bb-storage

Storage daemon, capable of storing data for the Remote Execution protocol
Apache License 2.0
139 stars 91 forks source link

Scheduler unavailability should not impact cache operations #220

Open CaerusKaru opened 2 weeks ago

CaerusKaru commented 2 weeks ago

In the scenario where bb-storage frontend is pointing to both a remote cache and a bb-scheduler instance, if the scheduler suddenly goes down, the entire frontend instance essentially becomes crippled. However, cache actions should be totally unaffected by the scheduler's availability (as in the case where a customer passes --remote_cache but not --remote_executor).

Can we make unavailability of the scheduler a log in the console for cache API calls, while still returning an error for remote execution API calls?

EdSchouten commented 2 weeks ago

I suspect that what you’re seeing is that GetCapabilities() calls fail. Those need to merge properties returned by both the storage nodes and scheduler process. It’s also hard to cache/memoize these, as they depend on the credentials of the user.

CaerusKaru commented 2 weeks ago

Sure, but can we have it be that the call returns the equivalent of false (or aborted merge) for everyone if scheduler is down instead of crashing?

EdSchouten commented 2 weeks ago

As in, announce that the cluster supports remote caching? No, because that would cause flakiness if people try to do builds that only use remote execution without local fallback.

CaerusKaru commented 2 weeks ago

If they don’t have local fallback enabled but they do have remote executor specified, wouldn’t the CLI simply error that the endpoint doesn’t support RBE and then fail the build?

EdSchouten commented 2 weeks ago

Exactly. And that’s bad, because under the current model it’s possible to set —remote_retries sufficiently high, causing Bazel to simply wait for the scheduler to come online and run the build to completion.

CaerusKaru commented 2 weeks ago

True, but we have to weight that against the remote cache being completely inaccessible to everyone for that duration as a penalty. Maybe this should be a configuration option, then? Fail open with scheduler unavailability vs not?

moroten commented 2 weeks ago

If we know the configuration of the scheduler, it should be possible to implement configuration of its capabilities straight in the frontend.

EdSchouten commented 2 weeks ago

The scheduler is such a simple process to operate, I don’t see the value in that to be honest. Just run health checking against it and make sure it gets launched elsewhere if your server fails.