Make --remote_local_fallback honor --spawn_strategy

ulrfa commented 2 years ago

Description of the feature request:

In case of remote execution issues, I expect --remote_local_fallback with --spawn_strategy=remote,worker,linux-sandbox to fallback to worker if available, and otherwise to linux-sandbox.

But instead I see fallback based on --remote_local_fallback_strategy, which is supposed to be a no-op according to documentation

I see two alternatives, preferably:

make --remote_local_fallback honor --spawn_strategy

or otherwise:

Update documentation for --remote_local_fallback_strategy about it being a no-op (which is incorrect) and remove the deprecation of it.

What underlying problem are you trying to solve with this feature?

I would like reliable builds with linux-sandbox, when remote execution cluster is temporarily unavailable.

I could use --remote_local_fallback_strategy despite it claiming to be a no-op, but that results in deprecation warnings.

Which operating system are you running Bazel on?

No response

What is the output of `bazel info release`?

No response

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

I use a custom build with local patches on top of 5.0.0-pre.20210929.1, but it seems this code has not changed on main branch.

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Have you found anything relevant by searching the web?

Based on description/comments from @agoulti and @philwo in https://github.com/bazelbuild/bazel/issues/7480, it seems the intention was that --remote_local_fallback_strategy should not be needed any more.

Then @ishikhman wrote a https://github.com/bazelbuild/bazel/issues/7480#issuecomment-528352938 that it is still needed for this use case, but ticket was closed without addressing that.

Any other information, logs, or outputs that you want to share?

No response

larsrc-google commented 2 years ago

I've been looking at the fallback situation for a bit internally, where there are some extra legacy flags. --remote_local_fallback has the very reasonable use case of handling certain remote failures. Getting rid of --remote_local_fallback_strategy would be good. As for what should be the fallback strategy, I think it should just go on to the next strategy in its list of possible strategies - just as if remote execution had not been allowed for the action.

Example (with made-up error messages):

bazel build --spawn_strategy=remote,worker,local --strategy=Foo=remote,local //foo:bar
...
INFO: Foo action failed remotely, falling back to `local` strategy.
INFO: Bar action failed remotely, falling back to `worker` strategy.
...

ulrfa commented 2 years ago

I've been looking at the fallback situation for a bit internally, where there are some extra legacy flags. --remote_local_fallback has the very reasonable use case of handling certain remote failures. Getting rid of --remote_local_fallback_strategy would be good. As for what should be the fallback strategy, I think it should just go on to the next strategy in its list of possible strategies - just as if remote execution had not been allowed for the action.

Example (with made-up error messages):
bazel build --spawn_strategy=remote,worker,local --strategy=Foo=remote,local //foo:bar
...
INFO: Foo action failed remotely, falling back to `local` strategy.
INFO: Bar action failed remotely, falling back to `worker` strategy.
...

Thanks for working on this @larsrc-google! I think your solution is good!

My only concern is the number of messages printed in terminal. Are you planning reporting each individual action? I'm afraid that is too verbose when remote system is down and all actions fallback. Or is it just me misunderstanding the example?

larsrc-google commented 2 years ago

No, those were just for illustration. In practice, they should go in the log only.

liyuyun-lyy commented 2 years ago

Agree. The description of the document is quite confusing. Actually, we don't need to specify the spawn_strategy because the default strategy is already --spawn_strategy=remote,worker,linux-sandbox. However, it has nothing to do with --remote_local_fallback, and it cannot replace --remote_local_fallback_strategy. I'm surprised to find out that default local_fallback_strategy is local, because most of the time, the default strategy for Bazel is sandboxed.

larsrc-google commented 1 year ago

@liyuyun-lyy Why don't you think the list of strategies (--spawn_strategy and what might be set by mnemonic or regexp) replace --remote_local_fallback_strategy? I could imagine --remote_local_fallback just meaning "If remote fails, try the next one in line" (for some definition of "remote").

liyuyun-lyy commented 1 year ago

@liyuyun-lyy Why don't you think the list of strategies (--spawn_strategy and what might be set by mnemonic or regexp) replace --remote_local_fallback_strategy? I could imagine --remote_local_fallback just meaning "If remote fails, try the next one in line" (for some definition of "remote").

I'll give three examples.

we only set --spawn_strategy=remote,worker,linux-sandbox. If remote fails, it will not fallback and end with failture.
we set both --spawn_strategy=remote,worker,linux-sandbox and --remote_local_fallback. If remote fails, the bazel runner with fallback to use local strategy.
we set both --spawn_strategy=remote,worker,linux-sandbox and --remote_local_fallback and --remote_local_fallback_strategy=sandboxed. If remote fails, the bazel runner with fallback to use sandboxed strategy.

So case2 might not work as you expected. To reproduce the examples, you can have a look at issue #16132.

larsrc-google commented 1 year ago

Ah, so you're saying that the current implementation doesn't help. But what if --remote_local_fallback was changed to mean "try the next strategy in the regular list of strategies"?

liyuyun-lyy commented 1 year ago

Ah, so you're saying that the current implementation doesn't help. But what if --remote_local_fallback was changed to mean "try the next strategy in the regular list of strategies"?

Yes, I think that will be better.

larsrc-google commented 1 year ago

Hmmm... that would take some rework, Right now exec() (src/main/java/com/google/devtools/build/lib/exec/SpawnStrategyResolver.java:43) just picks the first possible strategy and execs that, so we don't really know where we are in the list of possible strategies. The --remote_local_fallback handling is entirely separate (src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnRunner.java:485).

larsrc-google commented 1 year ago

One thing we're missing is a way to signal "let someone else try" in the failure messages. This has also come up in dynamic execution and for workers.

jkeljo commented 9 months ago

Since --remote_local_fallback_strategy is not a no-op (setting it to linux-sandbox does in fact cause Bazel to fall back to that strategy instead of local), what do you all think about updating the docs and removing the @Deprecated tag while you're working on figuring out the next steps for spawn strategies as a whole?

bazelbuild / bazel