Closed fmeum closed 2 years ago
Huh. I hope this is something small. I was thinking https://github.com/bazelbuild/bazel/commit/711c44ea79db14406972cc2204a43c589a2debce would be a pure improvement.
This is strange, the ST-hashes should be different if affected-by-Starlark-transition is different
I am working on this but out of curiosity, does anyone who encountered this have a smallest possible test case? The other example I have been debugging with involve cc_proto_library computation, which is also somewhat large.
New theory: https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L443 this null check is bad. (Short story: just because the Starlark value is at its default does not mean it should be excluded from ST hash addition; especially since the commandline may have explicitly set the Starlark value.)
I have a change that adjusts the hashing logic to do something more correct and this is seemingly fixing the internal test case I have. Will try to export something for you to test against your cases as well.
I do not have an ironclad explanation as to why this bug only surfaced now. My best guess is that the previous bug with transitionDirectoryNameFragment not being updated properly was actually masking this.
Good catch! Removing the check makes all action conflicts go away.
I do think that the check is bad for a different reason though: Omitting a variable foo
from the hash if it is null
still leads to a hash value distinct from those for all other possible values of foo
, so the command-line value of foo
shouldn't matter here. But the hash also serves the purpose of encoding the set affectedByStarlarkTransition
, which it fails to do with the check as keys might be missing, which exactly leads to the action conflicts we were seeing.
Yes, in practice when the value is null I am going to have it print key@null (just in case a user has a String StarlarkOption with non-null value "null" resulting in key=null).
The intention with the check for null was to:
toOptions.getStarlarkOptions().forEach((opt, value) -> toHash.put(opt.toString(), value));
First building a library, with dependency graph A-B-C, and populating remote caches:
A
/ \
B C
Then building an application using the library A-B-C, but also E-F with a local transition for a starlark option unrelated to A-B-C.
D
/ \
E A
Transition-/ / \
F B C
Without the check for null, I suspect there would be no remote cache hits for A-B-C when building D, due to different ST hash. Right?
Does that make sense? I have only experience of transitions near the top of the build, not deep in the dependency graph.
Thanks for the explanation. I'd expect remote caching to continue to work in that case.
The logic is subtle. Look at:
When evaluating A
, buildConfigOptions.affectedByStarlarkTransition
contains the cumulative list of Starlark and native flags that were changed by any transition from D down to A. This is documented at: https://github.com/bazelbuild/bazel/blob/21dfe4cdc35ed0b3536accdc91be042aa5c550aa/src/main/java/com/google/devtools/build/lib/analysis/config/CoreOptions.java#L265-L273
The flag is updated at https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L399
which only triggers on a transition. There are no transitions from D to A, so in the case of A this would be empty and we'd have no hash at all.
In the case of F it's not empty. So when evaluating F, the loop will trigger and read value
from toOptions.getStarlarkOptions()
. toOptions.getStarlarkOptions()
contains the cumulative <name, value> pairs of Starlark options changed anywhere fro D to F. This is set at https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L383-L386.
Here's the part where I confused myself. :/ There are two reasons value could be null. One is that it's not in the toOptions.getStarlarkOptions()
map. But shouldn't it always be in the map, given how both toOptions.getStarlarkOptions()
and buildConfigOptions.affectedByStarlarkTransition()
are computed as described above? The other possibility is that it is in the map but legitimately has a null value - I don't know how it's possible to set a Starlark option to a null value, so I'm fuzzy on that.
So that's a mystery to me.
But to your point, whatever's going on it should all be a pure function of the dependency path from the top-level to the target in question. So since the path from D to A doesn't have a transition, that should have no bearing on the path from D to F. They should each be computed completely independently of each other.
Does that all make sense? Anyone see holes in my thinking?
Here's the part where I confused myself. :/ There are two reasons value could be null. One is that it's not in the
toOptions.getStarlarkOptions()
map. But shouldn't it always be in the map, given how bothtoOptions.getStarlarkOptions()
andbuildConfigOptions.affectedByStarlarkTransition()
are computed as described above?So that's a mystery to me.
@sdtwigg helpfully pointed out that transitions include a post-validation step that removes flags from toOptions.getStarlarkOptions()
if their new values are the same as their flag default values: https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/StarlarkTransition.java#L353
As far as I can see no such removal happens for affectedByStarlarkTransition()
. Which would indeed make it possible that the loop over buildConfigOptions.affectedByStarlarkTransition()
includes flags that don't appear in toOptions.getStarlarkOptions()
.
One more comment:
If the configuration at the top-level (command line) has all flags at their defaults, then transition A sets //foo
to <not its default>
, then transition B sets //foo
back to its default, this means the configuration after B is the same as the top-level.
In that case you'd want both configurations to not have an ST-<hash>
at all. I can see that the null check prevents adding these settings to the hash. But even if the hash is empty (which the null check guarantees), you'd still get an ST-<hash>
after B due to https://github.com/bazelbuild/bazel/blob/a9206ebe275ce3de22f5c68b03ad58f6ae0b0cb2/src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java#L482
That's an inefficiency. But that's an existing inefficiency. @sdtwigg 's ongoing work should remove that too.
Thanks for the explanation!
I currently don't have transitions setting options back to their defaults, but I'm looking forward to @sdtwigg's optimizations in that area.
When evaluating
A
,buildConfigOptions.affectedByStarlarkTransition
contains the cumulative list of Starlark and native flags that were changed by any transition from D down to A.
I got the impression that buildConfigOptions.affectedByStarlarkTransition
is global for the whole build and can contain option names from transitions in E-F also when processing unrelated transitions in A-B-C. And that the null check prevented those names from E-F to affect ST hash in A-B-C. That is my concern with removing the null check. But perhaps I misunderstood or I'm missing something?
There are no transitions from D to A, so in the case of A this would be empty and we'd have no hash at all.
True, but let’s adapt the example and imagine there are also other unrelated transitions in A-B-C so that they will have hashes.
buildConfigOptions.affectedByStarlarkTransition
isn't global, so no worries there. The general principle is anything you set on node B propagates down B's subgraph, but has no effect on subgraphs B isn't part of.
I see, thanks @gregestren!
Description of the problem / feature request:
Since 711c44ea79db14406972cc2204a43c589a2debce, transitions that lead to configs with differing
affected by starlark transition
value but otherwise identical configs lead to action conflicts.Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
bazel build //...
with Bazel at 2255ce4165f936f695111020fa664b259a875c4a, the parent of the breaking commit.bazel build //...
with Bazel at 711c44ea79db14406972cc2204a43c589a2debce.The config diff:
What operating system are you running Bazel on?
Linux
What's the output of
bazel info release
?development version
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.From 2255ce4165f936f695111020fa664b259a875c4a resp. 711c44ea79db14406972cc2204a43c589a2debce
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?Have you found anything relevant by searching the web?
Related to the discussion at https://github.com/bazelbuild/bazel/issues/14023.
Any other information, logs, or outputs that you want to share?
@sdtwigg @gregestren