bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.2k stars 4.06k forks source link

Action conflicts due to differing `affected by starlark transition` value #14239

Closed fmeum closed 2 years ago

fmeum commented 2 years ago

Description of the problem / feature request:

Since 711c44ea79db14406972cc2204a43c589a2debce, transitions that lead to configs with differing affected by starlark transition value but otherwise identical configs lead to action conflicts.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  1. Check out https://github.com/bazelbuild/rules_go/tree/v0.29.0.
  2. Run bazel build //... with Bazel at 2255ce4165f936f695111020fa664b259a875c4a, the parent of the breaking commit.
  3. Observe that the build passes.
  4. Run bazel build //... with Bazel at 711c44ea79db14406972cc2204a43c589a2debce.
  5. Observe that there are action conflicts such as:
ERROR: file 'external/com_google_protobuf/_objs/protoc_lib/java_primitive_field.o' is generated by these conflicting actions:
Label: @com_google_protobuf//:protoc_lib
RuleClass: cc_library rule
Configuration: 00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c, 4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372
Mnemonic: CppCompile
Action key: 5d046af47458a54c0933bf1911899f17c727e1eec656b0b9728c7f3fbc5b0a22
Progress message: Compiling src/google/protobuf/compiler/java/java_primitive_field.cc
PrimaryInput: File:[/home/fhenneke/.cache/bazel/_bazel_fhenneke/da2ee2360ba14ac857bdb47dfbba9b7d/external/com_google_protobuf[source]]src/google/protobuf/compiler/java/java_primitive_field.cc
PrimaryOutput: File:[[<execution_root>]bazel-out/k8-opt-exec-2B5CBBC6-ST-10e8eac0bed8/bin]external/com_google_protobuf/_objs/protoc_lib/java_primitive_field.o
Owner information: ConfiguredTargetKey{label=@com_google_protobuf//:protoc_lib, config=BuildConfigurationKey[00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c]}, ConfiguredTargetKey{label=@com_google_protobuf//:protoc_lib, config=BuildConfigurationKey[4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372]}
MandatoryInputs: are equal
Outputs: are equal

The config diff:

$ bazel config 00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c 4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372
INFO: Invocation ID: eb06f0c4-ddc7-4bc6-8cc6-e8fd2e7e0e4e
INFO: Displaying diff between configs 00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c and 4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372
Displaying diff between configs 00a60040d64dca849a5217221376411c93467de0d0a93ae55d94f0b96a6d631c and 4fe5f6ed614cebd8a30e459cb1e450bd49548bab2576c90c2155c97782853372
FragmentOptions com.google.devtools.build.lib.analysis.config.CoreOptions {
  affected by starlark transition: [//go/private:bootstrap_nogo], [//go/config:pure, //go/private:bootstrap_nogo]
}

What operating system are you running Bazel on?

Linux

What's the output of bazel info release?

development version

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

From 2255ce4165f936f695111020fa664b259a875c4a resp. 711c44ea79db14406972cc2204a43c589a2debce

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

https://github.com/bazelbuild/bazel
fad4933ea3c7cd757f32b0760ac268788b3b1d0a
711c44ea79db14406972cc2204a43c589a2debce

Have you found anything relevant by searching the web?

Related to the discussion at https://github.com/bazelbuild/bazel/issues/14023.

Any other information, logs, or outputs that you want to share?

@sdtwigg @gregestren

gregestren commented 2 years ago

Huh. I hope this is something small. I was thinking https://github.com/bazelbuild/bazel/commit/711c44ea79db14406972cc2204a43c589a2debce would be a pure improvement.

sdtwigg commented 2 years ago

This is strange, the ST-hashes should be different if affected-by-Starlark-transition is different

sdtwigg commented 2 years ago

I am working on this but out of curiosity, does anyone who encountered this have a smallest possible test case? The other example I have been debugging with involve cc_proto_library computation, which is also somewhat large.

sdtwigg commented 2 years ago

New theory: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L443 this null check is bad. (Short story: just because the Starlark value is at its default does not mean it should be excluded from ST hash addition; especially since the commandline may have explicitly set the Starlark value.)

I have a change that adjusts the hashing logic to do something more correct and this is seemingly fixing the internal test case I have. Will try to export something for you to test against your cases as well.

I do not have an ironclad explanation as to why this bug only surfaced now. My best guess is that the previous bug with transitionDirectoryNameFragment not being updated properly was actually masking this.

fmeum commented 2 years ago

Good catch! Removing the check makes all action conflicts go away.

I do think that the check is bad for a different reason though: Omitting a variable foo from the hash if it is null still leads to a hash value distinct from those for all other possible values of foo, so the command-line value of foo shouldn't matter here. But the hash also serves the purpose of encoding the set affectedByStarlarkTransition, which it fails to do with the check as keys might be missing, which exactly leads to the action conflicts we were seeing.

sdtwigg commented 2 years ago

Yes, in practice when the value is null I am going to have it print key@null (just in case a user has a String StarlarkOption with non-null value "null" resulting in key=null).

ulrfa commented 2 years ago

The intention with the check for null was to:

Example:

First building a library, with dependency graph A-B-C, and populating remote caches:

  A
 / \
B   C

Then building an application using the library A-B-C, but also E-F with a local transition for a starlark option unrelated to A-B-C.

              D   
             / \
            E   A
Transition-/   / \
          F   B   C

Without the check for null, I suspect there would be no remote cache hits for A-B-C when building D, due to different ST hash. Right?

Does that make sense? I have only experience of transitions near the top of the build, not deep in the dependency graph.

gregestren commented 2 years ago

FYI https://bazel-review.googlesource.com/c/bazel/+/180710.

gregestren commented 2 years ago

Thanks for the explanation. I'd expect remote caching to continue to work in that case.

The logic is subtle. Look at:

https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L425

https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L442-L445

When evaluating A, buildConfigOptions.affectedByStarlarkTransition contains the cumulative list of Starlark and native flags that were changed by any transition from D down to A. This is documented at: https://github.com/bazelbuild/bazel/blob/21dfe4cdc35ed0b3536accdc91be042aa5c550aa/src/main/java/com/google/devtools/build/lib/analysis/config/CoreOptions.java#L265-L273

The flag is updated at https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L399

which only triggers on a transition. There are no transitions from D to A, so in the case of A this would be empty and we'd have no hash at all.

In the case of F it's not empty. So when evaluating F, the loop will trigger and read value from toOptions.getStarlarkOptions(). toOptions.getStarlarkOptions() contains the cumulative <name, value> pairs of Starlark options changed anywhere fro D to F. This is set at https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L383-L386.

Here's the part where I confused myself. :/ There are two reasons value could be null. One is that it's not in the toOptions.getStarlarkOptions() map. But shouldn't it always be in the map, given how both toOptions.getStarlarkOptions() and buildConfigOptions.affectedByStarlarkTransition() are computed as described above? The other possibility is that it is in the map but legitimately has a null value - I don't know how it's possible to set a Starlark option to a null value, so I'm fuzzy on that.

So that's a mystery to me.

But to your point, whatever's going on it should all be a pure function of the dependency path from the top-level to the target in question. So since the path from D to A doesn't have a transition, that should have no bearing on the path from D to F. They should each be computed completely independently of each other.

Does that all make sense? Anyone see holes in my thinking?

gregestren commented 2 years ago

Here's the part where I confused myself. :/ There are two reasons value could be null. One is that it's not in the toOptions.getStarlarkOptions() map. But shouldn't it always be in the map, given how both toOptions.getStarlarkOptions() and buildConfigOptions.affectedByStarlarkTransition() are computed as described above?

So that's a mystery to me.

@sdtwigg helpfully pointed out that transitions include a post-validation step that removes flags from toOptions.getStarlarkOptions() if their new values are the same as their flag default values: https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/StarlarkTransition.java#L353

As far as I can see no such removal happens for affectedByStarlarkTransition(). Which would indeed make it possible that the loop over buildConfigOptions.affectedByStarlarkTransition() includes flags that don't appear in toOptions.getStarlarkOptions().

gregestren commented 2 years ago

One more comment:

If the configuration at the top-level (command line) has all flags at their defaults, then transition A sets //foo to <not its default>, then transition B sets //foo back to its default, this means the configuration after B is the same as the top-level.

In that case you'd want both configurations to not have an ST-<hash> at all. I can see that the null check prevents adding these settings to the hash. But even if the hash is empty (which the null check guarantees), you'd still get an ST-<hash> after B due to https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L482

That's an inefficiency. But that's an existing inefficiency. @sdtwigg 's ongoing work should remove that too.

ulrfa commented 2 years ago

Thanks for the explanation!

I currently don't have transitions setting options back to their defaults, but I'm looking forward to @sdtwigg's optimizations in that area.

When evaluating A, buildConfigOptions.affectedByStarlarkTransition contains the cumulative list of Starlark and native flags that were changed by any transition from D down to A.

I got the impression that buildConfigOptions.affectedByStarlarkTransition is global for the whole build and can contain option names from transitions in E-F also when processing unrelated transitions in A-B-C. And that the null check prevented those names from E-F to affect ST hash in A-B-C. That is my concern with removing the null check. But perhaps I misunderstood or I'm missing something?

There are no transitions from D to A, so in the case of A this would be empty and we'd have no hash at all.

True, but let’s adapt the example and imagine there are also other unrelated transitions in A-B-C so that they will have hashes.

gregestren commented 2 years ago

buildConfigOptions.affectedByStarlarkTransition isn't global, so no worries there. The general principle is anything you set on node B propagates down B's subgraph, but has no effect on subgraphs B isn't part of.

ulrfa commented 2 years ago

I see, thanks @gregestren!