Closed torgil closed 5 months ago
I'm a bit lost in the conversation history here and at https://github.com/bazelbuild/bazel/pull/13587.
I didn't fully follow Case 3. For example, can you elaborate "This has the unwanted consequence that E can no longer be a dependency of D"?
And what is the exact goal? All of @sdtwigg 's ongoing work should eliminate action conflicts more fundamentally. Is the primary concern of this issue action conflicts or duplicated action performance issues or both?
I'm a bit lost in the conversation history here and at #13587.
Note also that there were also other namespace aspects discussed in #13587, like the IDE/debug-issue, not covered in either of these issues. This means solving both of these issues is required but not necessarily sufficient to deprecate the "output directory name" transition. Edit: Different actions within a single rule may also have different namespace requirements.
I didn't fully follow Case 3. For example, can you elaborate "This has the unwanted consequence that E can no longer be a dependency of D"?
Sure. Since D (or edge to D) needs to reset build-setting values to avoid above conflict it cannot depend on any target dependent on these build-settings (E above). The build-setting dependency can arise both through use of native functions like cc_common.compile and/or the use of custom build-settings. This means that the graph has to be rewritten such that E is not a dependency of D (for instance moved to B and C in the example above).
And what is the exact goal? All of @sdtwigg 's ongoing work should eliminate action conflicts more fundamentally. Is the primary concern of this issue action conflicts or duplicated action performance issues or both?
The goal with this is issue is to remove the need for current and upcoming workarounds (technical debt). Workarounds is chosen case by case depending on situation but is typically:
I didn't fully follow Case 3. For example, can you elaborate "This has the unwanted consequence that E can no longer be a dependency of D"?
Sure. Since D (or edge to D) needs to reset build-setting values to avoid above conflict it cannot depend on any target dependent on these build-settings (E above). The build-setting dependency can arise both through use of native functions like cc_common.compile and/or the use of custom build-settings. This means that the graph has to be rewritten such that E is not a dependency of D (for instance moved to B and C in the example above).
That makes more sense, thank you.
So in the above example, the sole purpose of the build setting is for B/C to set it and for E to consume it? And you desire both of
d.out
produced by any action in D has the path bazel-out/<same prefix>/foo/d.out
. This is safe because that action doesn't depend on the build settinge.out
produced by E has path bazel-out/<prefix1>/foo/e.out
. For C->D->E, it's bazel-out/<prefix2>/foo/e.out
. This is because each version of e.out
has different content since it depends on the build setting.The goal with this is issue is to remove the need for current and upcoming workarounds (technical debt). Workarounds is chosen case by case depending on situation but is typically:
- duplicated actions due to different output paths, similar to case1 above
- need for edge transitions resetting build-settings to various dependencies
- non-optimal graphs like case3 discussed above. D might for instance be a subsystem with many actions and high duplication cost
I support this goal. But there's an intrinsic tension between efficiency (non-duplication, efficient graphs) and correctness (no action conflicts). We can't confidently know if D's actions depend on the build setting without having fine-grained insight into whether it's actions consume anything coming from E. The current Starlark APIs don't provide enough isolation to make that easy: D's rule implementation runs a bunch of Starlark that presumably includes E somehow (otherwise why would E be a dep?) and also creates a bunch of actions. If any of these are intertwined we get implicit dependencies that require D's outputs to be different.
If you really know for a fact this isn't a concern for your rules, we can consider some kind of special tagging. But that won't apply generally, and that puts more responsibility on you as rule designer to ensure these tags are correct and don't fall out of date.
There's also the stripped output path approach, which I've long been interested in: don't worry about these issues, but before sending the work to the executor strip all config paths completely. The net effect is actions won't be executed redundantly. But the build graph itself will still retain redundancies, so it's only a partial solution. Whether that's enough by itself of course depends on specific use cases.
The lens that I have found most clear for thinking about this is thinking about how to handle a single rule with multiple actions, some of which depend on the build configuration while others don't. For example, a (custom) C++ rule that wanted to compile only once but link in several different configurations for different targets. In this view, an ideal API would ctx.declare_output()
to specify configuration keys that should or should not be included in the output directory root for this specific file, lowering the concept of "output root" to the action level instead of the rule level. The same issue applies when a rule's analysis-time behavior depends on configuration but execution-time behavior does not.
@gregestren correctly points out that a naive implementation of this breaks correctness if the rule author misses some subtle way that the configuration value can affect the action parameters. At minimum, the rule user can always pass srcs = select(...)
where select is operating on build configuration that's not supposed to affect this action. As previously established in great depth, automatically de-conflicting when and only when this changes the output is very hard.
For our use case (@torgil can you confirm for yours) I think we can avoid this problem by instead just requiring that all instances of an action that generate a given file in a given output root must be identical, and failing the build if that is not the case. That way, if someone tries to get tricky with select statements, it just fails.
This proposed rule (all actions generating a given file in a single build must be identical) is looser than the current rule in Bazel master (only one action may generate a file) but stricter than that in 4.x (hahaha, go for it, hopefully the outputs are equivalent). It still leaves the onus on rule authors to determine what files are "supposed" to depend on what configuration values, but that's acceptable for us.
So in the above example, the sole purpose of the build setting is for B/C to set it and for E to consume it? And you desire both of
- For both B->D and C->D any artifact
d.out
produced by any action in D has the pathbazel-out/<same prefix>/foo/d.out
. This is safe because that action doesn't depend on the build setting- For B->D->E, any artifact
e.out
produced by E has pathbazel-out/<prefix1>/foo/e.out
. For C->D->E, it'sbazel-out/<prefix2>/foo/e.out
. This is because each version ofe.out
has different content since it depends on the build setting.
Yes, but it can also be the case that E sends build-setting based information up to B/C to consume. It can (depending on use-case) also make sense for E to avoid name-collisions in the local file path, eg bazel-out/<prefix>/foo/e.out-suffix1
(see example above).
I support this goal. But there's an intrinsic tension between efficiency (non-duplication, efficient graphs) and correctness (no action conflicts). We can't confidently know if D's actions depend on the build setting without having fine-grained insight into whether it's actions consume anything coming from E. The current Starlark APIs don't provide enough isolation to make that easy: D's rule implementation runs a bunch of Starlark that presumably includes E somehow (otherwise why would E be a dep?) and also creates a bunch of actions. If any of these are intertwined we get implicit dependencies that require D's outputs to be different.
A common situation leading here is that variation tend to appear in lower level nodes (E). D may be library without need for variation itself but it calls functions that is in turn has configuration specific functionality depending on os, platform, hardware or other. B and C may want to control this environment/configuration through a build-setting. The object files and static library D produces are independent of the build-setting but when B or C link binaries they need to link against the correct (configuration dependent) libraries given by E.
@gregestren correctly points out that a naive implementation of this breaks correctness if the rule author misses some subtle way that the configuration value can affect the action parameters.
With "output directory name" transition (no hashes), this fails with action conflicts today.
For our use case (@torgil can you confirm for yours) I think we can avoid this problem by instead just requiring that all instances of an action that generate a given file in a given output root must be identical, and failing the build if that is not the case. That way, if someone tries to get tricky with select statements, it just fails.
Yes. This will produce an action conflict the rule author needs to take care of. Can you elaborate on how a user could break correctness with tricky select statements?
This proposed rule (all actions generating a given file in a single build must be identical) is looser than the current rule in Bazel master (only one action may generate a file) but stricter than that in 4.x (hahaha, go for it, hopefully the outputs are equivalent).
In case3 above, D have two identical actions generating the same file. It worked on master as of Friday. The "output directory name" patch didn't add anything enforce this.
This proposed rule (all actions generating a given file in a single build must be identical) is looser than the current rule in Bazel master (only one action may generate a file) but stricter than that in 4.x (hahaha, go for it, hopefully the outputs are equivalent). It still leaves the onus on rule authors to determine what files are "supposed" to depend on what configuration values, but that's acceptable for us.
Which change from 4.x to master do you mean? Removal of the output directory name
flag?
This is similar to action sharing, although my reading is you're saying something slightly different.
In case3 above, D have two identical actions generating the same file. It worked on master as of Friday. The "output directory name" patch didn't add anything enforce this.
I may be wrong about how this works, and apologies for taking so man words to explain this, but what I think you're referring to with case 3 is where all configuration settings on the rule are identical, in which case the analysis-time rule graph only has a single node, and I've been referring to that as a "single rule" and "single action". I'm instead trying to refer to a situation where multiple differently-configured rules produce actions with identical parameters, because this particular action in the rule doesn't depend on the configuration values that differ. This was possible to accomplish in 4.x using output directory name
, but I don't know of a way to accomplish it on master today.
My proposal is to 1. add an API to make this possible again on a per-output-file or per-action basis and 2. enforce that when it happens, the two actions involved must have identical parameters.
This is similar to action sharing, although my reading is you're saying something slightly different.
I hadn't heard of this before, but it seems like it might be exactly what I'm looking for here. Multiple differently-configured versions of a rule should be able to declare that a specific output file does not depend on the configuration, in which case the output root should be configuration-blind and any actions producing that file should be "shared" between the differently-configured rule instances and only executed once (with an error if the action key is different). This is different than Case 3 because the same rules that "share" this action might also be doing other things like declaring other actions that do depend on the configuration or passing data through providers from other rules that do depend on the configuration.
For our use case (@torgil can you confirm for yours) I think we can avoid this problem by instead just requiring that all instances of an action that generate a given file in a given output root must be identical, and failing the build if that is not the case. That way, if someone tries to get tricky with select statements, it just fails.
Yes. This will produce an action conflict the rule author needs to take care of. Can you elaborate on how a user could break correctness with tricky select statements?
In Bazel 4.x, we were using output directory name
to create code-generation rules that always put their outputs into bazel-out/unconfigured
no matter what configuration they were built in. This would result in two instances of the same rule producing the same output filepath, which worked fine as long as those outputs were in fact identical. I was using "tricky select statements" to refer to one way that (unexpectedly to the rule author) this could fail to be the case.
# This rule is written assuming that its output files will always be
# bit-for-bit identical regardless of the active configuration
generate_code(
srcs = select({"@platforms//os:linux": ["a.in"], ...})
)
I could have sworn that I saw this happen once without Bazel raising an error, and instead just picking one configuration of the rule to "win", although it's possible that even in 4.x there were safety guards against action conflicts like this and I'm misremembering. If those guards are in place, I think they should still be sufficient going forward to make sure that no two different-key actions in the same build produce the same file, which is the only way correctness issues arise.
I'm not aware of any way to produce this sort of conflict in 5.x prereleases today.
I'm instead trying to refer to a situation where multiple differently-configured rules produce actions with identical parameters, because this particular action in the rule doesn't depend on the configuration values that differ. This was possible to accomplish in 4.x using
output directory name
, but I don't know of a way to accomplish it on master today.
Alright. This is exactly what I'm trying to solve with output directory name
in combination with the solution to this issue which if I recall right also was an issue in 4.x with maybe the difference that the configuration hash also was included in the action hash.
My proposal is to 1. add an API to make this possible again on a per-output-file or per-action basis and 2. enforce that when it happens, the two actions involved must have identical parameters.
This should also be hit by above config hash issue. It would simplify deduplication of actions in cases where a rule creates both platform dependent and platform independent actions. The alternates I've considered are less appealing:
Is there another way to deal with this today?
I don't see this replace output directory name
transition as the naming do not propagate to dependencies or third party rules if you for instance are using different build-settings for different parts of the system and the different parts depend on the same third party codebase (like grpc or llvm) or creates artifacts using third-party rules (like rust or python).
I looked little further in the case 2 action conflict and noticed it failed because the compile action(s) for D2.c weren't shareable. The following two patches seems to fix case 2:
With these patchs, aquery says there is one compile action for D2.c and one link action for libD2.a. Building A2 seems to have a glitch that it shows multiple versions of the same file in the command line output:
$ bazel build //:A2
INFO: Analyzed target //:A2 (35 packages loaded, 303 targets configured).
INFO: Found 1 target...
Target //:A2 up-to-date:
bazel-out/any/bin/libE2_bs1_B.a
bazel-out/any/bin/libD2.a
bazel-out/any/bin/libE2_bs1_C.a
bazel-out/any/bin/libD2.a
INFO: Elapsed time: 0.513s, Critical Path: 0.09s
INFO: 13 processes: 7 internal, 6 linux-sandbox.
INFO: Build completed successfully, 13 total actions
Actions seems shared (building A1 has 13 processes, 8 linux-sandbox).
Patches are here (based on master today):
https://github.com/torgil/bazel/commits/cpp-actions-shareable
git fetch https://github.com/torgil/bazel.git refs/heads/cpp-actions-shareable
@gregestren Do we know why Cpp actions are not shareable?
Except for the glitch, are there any other problems with the above solution for case 2?
C++ actions aren't shareable because of include scanning: finding dependencies by reading #include <foo.h>
in source files. That creates a mutable dependency graph: some dependencies are added (or even removed) after action execution gets a better sense of these dependencies.
It's all annoyingly complicated but the main takeaway is it's harder to know at analysis time if actions are really the same when the above means an action may actually change later in the build.
We've talked about ways of loosening this restriction but it's a challenging issue.
@torgil
I don't see this replace output directory name transition as the naming do not propagate to dependencies or third party rules if you for instance are using different build-settings for different parts of the system and the different parts depend on the same third party codebase (like grpc or llvm) or creates artifacts using third-party rules (like rust or python).
For cases where the new directory name should propogate all the way through the dependency tree (third party code), what are your use cases that aren't solved by #14023 combined with using a transition so you can always depend on the k8-fastbuild
version of the third party dependency? We use that approach, and it also avoids running the analysis phase multiple times for those rules. In particular, for third party code we often don't want to unify some pieces of configuration (cpu/os) but do want to unify our starlark configuration.
I need to do more thinking about how to handle third-party rules like rust, maybe rules should have a default attribute to specify config keys to ignore in all actions/files they generate?
For cases where the new directory name should propogate all the way through the dependency tree (third party code), what are your use cases that aren't solved by #14023 combined with using a transition so you can always depend on the
k8-fastbuild
version of the third party dependency?
Yes. This is similar to what we do today but it's non-ideal:
We use that approach, and it also avoids running the analysis phase multiple times for those rules. In particular, for third party code we often don't want to unify some pieces of configuration (cpu/os) but do want to unify our starlark configuration.
Good point. In this case Bazel should have the knowledge to drop those build-settings in the third party target configuration without the need to explicitly reset them in a transition since they're not included in the third party code dependency tree.
C++ actions aren't shareable because of include scanning: finding dependencies by reading
#include <foo.h>
in source files. That creates a mutable dependency graph: some dependencies are added (or even removed) after action execution gets a better sense of these dependencies.
Could two actions that has identical action hash after the analyze phase end up as different actions in the execution phase?
It's all annoyingly complicated but the main takeaway is it's harder to know at analysis time if actions are really the same when the above means an action may actually change later in the build.
Is it a problem that two actions that are different in the analyze phase ends up the same after include scanning?
We've talked about ways of loosening this restriction but it's a challenging issue.
I've updated my branch above with a new core option --experimental_shareable_cpp_compile_actions
. Would something like this (along with output directory name transition) be a feasible solution?
@sdtwigg is working on the overall action conflict issue: let's wait for that to land and then re-address this.
Rebased output directory name revert can be found here: https://github.com/torgil/bazel/commits/output-directory-name-option Both directory name, cpp shareable and a few other (coverage, dep-files) patches here: https://github.com/torgil/bazel/commits/master
Edit: The patch from #13587 is still needed (included in master above) as the fix for #12731 somehow triggered changes in "affected by starlark transition" for different paths in some graphs. These changes ended up in the action-hash through ST-hashes appended to "internal/_middlemen" path-names which resulted in action conflicts.
@sdtwigg is working on the overall action conflict issue: let's wait for that to land and then re-address this.
I've now created #16844 on this issue since
Is it possible to solve this within the 6.0 cycle?
Edit: question/suggestion about initial configuration removed, need to rethink that a couple of iterations
Made a pull request on the --experimental_shareable_cpp_compile_actions flag (see above): #16863
My proposal is to 1. add an API to make this possible again on a per-output-file or per-action basis
@eric-skydio @gregestren What about an option to omit the "output directory name" completely and let rules resolve path conflicts at package level (eg bazel-out/external/... or bazel-out/allconfig/external/... if too many scripts break) ?
@torgil this is complicated and long-standing because the issue is fundamentally complicated. And it's not a single issue: it's a related collection of issues with varying levels of support and safety challenges. We don't want to make choices that compromise correctness. It's extremely easy to do that with some of these ideas. We're obligated to find a balance that doesn't make it easy for users to run builds with bad results (by "bad", I don't mean action conflicts, I mean builds that say they're successful but produce wrong output).
bazel-out/foo-bar-ST-<hash>/
is annoying. Many people here are discovering all the intricacies of that. But there's a reason, for all its imperfections, that foo-bar-ST-<hash>
exists. Changing output directory name
or making C++ actions shareable crudely removes basic safety checks for build correctness. I respect your own technical expertise, and that that's an effective solution for your builds for lack of better options. But these aren't safe patterns to promote to less experienced builders.
Maybe we could make shared C++ actions work. But we should really understand why they're not shareable today and how such a flag affects that. We haven't discussed in these threads, for example, how Skyframe (Bazel's core execution engine), might get confused and build the wrong targets because of that. That's the kind of problem we need some answer for to promote a path like that.
I realize this is directly pertinent to you. All theory aside, you want builds that work and are fast. And that's compromised right now. I'm totally on board with trying to help with that. And I'm open to more hacky solutions if they're practically helfpul for lack of better solutions now. But we really have to be cautious, clear, and methodical about exactly what problems we're solving with any solution and what side effects it has.
Examples of related but distinct problems:
A -> D -> E
and B -> D > E
. A
and B
set different transitions that only matter to D
. Yet E
still builds twiceA -> D -> E
and B -> D > E
. A
and B
set different transitions that only matter to E
. D
's actions don't care about E'
s output but D
s actions still build twice$ bazel build //:foo --define a=1; bazel build /:foo --define=2
: second build drops and reconstructs the entire build graph from scratchWe're looking at all of these and making at least some progress. To be fair, I don't think we've communicated that well: it's mostly been ad hoc conversations with whoever happens to be part of those conversations. And there's been lots of behind-the-scenes work that hasn't yet manifested into user-accessible Bazel features.
Here are some examples of progress:
--experimental_output_directory_naming_scheme=diff_against_baseline
should mostly solve https://github.com/bazelbuild/bazel/issues/14023. We're negotiating now about whether we can flip that for Bazel 6.0. Which is just a matter of release policy. The flip itself is technically trivial. This solves 1. for the vast majority of cases. --trim_test_configuration
from Bazel 5.0.It's also possible we can think of creative new ways to define transitions that might avoid some of the propagation issues you and others experience.
As stated by @eric-skydio and others upthread, non-C++ actions should be perfectly shareable as long as they're the same actions (I think there's one other class of non-shareable actions, but most are shareable). And for problems like non-functional actions, a better Starlark API for writing actions and containing their inputs could address that. That's likely the only practical approach toward that problem.
We had some great chats at this last BazelCon. As one followup I'd like to promote a clearer common view of the problems and how we imagine addressing them. I think we need a better interface than GitHub issues because they quickly get unfocused and out-of-date. And it's hard to reason precisely about all the challenges without really precise common understandings of the problems.
Oh, I forgot: https://github.com/bazelbuild/bazel/commit/7f51c8b374f6f6c5b2185e50baf8bb9aa9131fd3 could help with rules that really have no configuration needs. That's limited to solving these problems with more transitions, and would need a Starlark hook to be generally usable.
@gregestren Great job with the diff_against_baseline
work, and sorry for checking out of this conversation. Let me know if you have specific questions or want to brainstorm.
In testing on our repo with the 6.0 release candidates, it initially seems to work really well but exposed one other tiny bug that needs to get fixed for us to see all the benefits (#16911). Without that fixed, I can't find a way to unify the configuration of a generated code file that's depended on by both a executable tool and an output target.
Thanks for quickly tackling #16911 @fmeum.
@gregestren Thank you for your detailed response. I totally respect the complexity regarding these issues (and I have seen most of them), especially for the general case and I appreciate your patience towards a sound solution. I've wrapped my head into these issues numerous times and the problem domain seems to get more complex for each round (probably ~80% of the times I need to fire up jdb on bazel is due to these issues).
Of your enumerated issues I prefer to get action conflicts over silent code duplication (to later detect you have built a random tool multiple times with flags intended for target code testing). It forces you to learn about your build and deal with it accordingly. I understand that this cannot be forced on the "casual user" as a default behavior.
The "allconfig" suggestion above was meant to address the problem with different naming needs for different actions within a single target, not as a configuration transition (which you still need for other reasons, like third-party dependencies).
5 unnecessary (untrimmed) configuration: A -> D -> E and B -> D > E. A and B set different transitions that only matter to D. Yet E still builds twice
If the transitions are affecting a starlark rule build setting both A, B and D (but not E) will have a path to it in the dependency graph. Is it possible to use that information?
too fine path names: --experimental_output_directory_naming_scheme=diff_against_baseline
This is certainly a big step forward and it also removes all "affected by starlark transition" issues. Good job! I've yet to understand all the changes of the 7.0 branch as I'm going slow, bisecting behavior changes in the build one at a time to understand them in detail.
Oh, I forgot: https://github.com/bazelbuild/bazel/commit/7f51c8b374f6f6c5b2185e50baf8bb9aa9131fd3 could help with rules that really have no configuration needs.
Would this run into the same issues as the "host" config where transitions on dependencies silently gets dropped?
@gregestren Thank you for your detailed response. I totally respect the complexity regarding these issues (and I have seen most of them), especially for the general case and I appreciate your patience towards a sound solution. I've wrapped my head into these issues numerous times and the problem domain seems to get more complex for each round (probably ~80% of the times I need to fire up jdb on bazel is due to these issues).
Thanks for your patience!
Could something like ctexplain help?
The idea behind that is a tool you could run over any build that reports precise information on where transitions happen and where they're wasteful. Such a tool could help us quickly target a build's exact waste profile which could help us more precisely evaluate and propose fixes. Part of what makes these issues hard, IMO, is the impact of these issues really varies by build. So it's hard to know which approaches are best without concrete data.
I wrote a functional v1 of ctexplain a while back. But it only partially landed in @bazelbuild/bazel (more PRs need merging). I'd love to integrate it more deeply but haven't had the personal bandwidth to push this forward.
5 unnecessary (untrimmed) configuration: A -> D -> E and B -> D > E. A and B set different transitions that only matter to D. Yet E still builds twice
If the transitions are affecting a starlark rule build setting both A, B and D (but not E) will have a path to it in the dependency graph. Is it possible to use that information?
If I'm reading you correctly, that's extremely valuable information. Yes, we can work with that. The challenge is how big that subpath may be. If it's really short (one node to its direct dep), I can imagine some pretty effective solutions. If it's arbitrarily long, we run into the tension between performance (extra computation needed to analyze a big subgraph) vs. maintainability (users can manually tag these edges to avoid that computation, but that can be a big user burden across dependencies, edges, different projects, etc.).
My auto-trimming prototype I mentioned above is an attempt to automatically solve that when the computation cost is acceptable.
Oh, I forgot: 7f51c8b could help with rules that really have no configuration needs.
Would this run into the same issues as the "host" config where transitions on dependencies silently gets dropped?
Yes. The intention of this config is that once you're in it, that's it. It's terminal. No more transitions.
Could something like ctexplain help?
Yes. Interesting. I ran the analyze function below on the examples in the description using a Bazel from Dec 8 master branch.
$ action_keys () { bazel aquery --skyframe_state --output=jsonproto | jq '.actions[].actionKey'; }
$ action_count () { echo "Action count: $(action_keys | wc -l)"; }
$ unique_action_count () { echo "Unique action count: $(action_keys | sort -u | wc -l)"; }
$ analyze () { bazel clean 2>/dev/null; ctexplain -b "$1"; action_count; unique_action_count; }
ctexplain is a wrapper function for ctexplain.py to find it deps. It also needed a few source changes for my python version, eg the usual "make random python script work" stuff.
Output for case 1 (no transition):
$ analyze //:A1
Collecting configured targets for //:A1... done in 0.44 s.
Configurations: 6
Targets: 30
Configured targets: 67 (+123.3% vs. targets)
Targets with multiple configs: 20
Action count: 28
Unique action count: 27
The shared action above is a file write on "Dl.c".
Output for case 2 (output directory name transition):
$ analyze //:A2
Collecting configured targets for //:A2... done in 0.49 s.
Configurations: 8
Targets: 30
Configured targets: 77 (+156.7% vs. targets)
Targets with multiple configs: 20
Action count: 28
Unique action count: 22
If I'm reading you correctly, that's extremely valuable information. Yes, we can work with that.
cquery --show_config_fragments
show that //:bs_1
is a configuration for //:E1
and //:E2
while CppConfiguration
does not trickle down to it's dependencies (seen if changing rule for E1/E2 to "no_transition").
users can manually tag these edges to avoid that computation
Do you mean adding an edge transition to set the "baseline" value here?
more PRs need merging
Are these PRs available somewhere?
@gregestren I pushed some small updates I needed to run ctexplain here: #17109
I think the ctexplain followups are
Yes. The intention of this config is that once you're in it, that's it. It's terminal. No more transitions.
Just chiming in quickly to point out that this limitation prevents us from using this to solve our issues with duplicating code generation work. In particular, we have code generation actions that don't depend on build configuration and we'd like to avoid running many copies of. Our current workaround is to manually starlark-transition every config value that gets set anywhere in our build tree to a sane default value for the codegen targets, creating the nocpu-fastbuild
configuration that is unified. Obviously it would be nicer to not need to track all that manually.
Unfortunately, since the codegen targets depend on tools that need to be built in exec
configuration to work, we do need transitions to keep working at least somewhat underneath that layer.
@eric-skydio naively I'd say the codegen tools are configuration-dependent, since some of their inputs are configuration-dependent. So what you're saying is that the differences don't matter? i.e. if toolA
has two different content hashes, you have a contract guarantee that those differences don't mean toolA
will produce different results?
Sorry, just noticed this. None of the tools nor their outputs depend on the target configuration, but they do depend on their own exec configuration being set correctly. This isn't the exact use case, but as an example, imagine that we have a C++ executable emit_readme
that can only build with opt
, and a genrule which invokes that executable to create a README, where the contents of the readme are entirely configuration-independent.
To make this work, we put an incoming edge transition on emit_readme
that sets the build mode to opt
. In a normal build, this would make it impossible for emit_readme to compile in a non-opt configuration. However, if the genrule itself were set to a terminal configuration, that would be infectious to the emit_readme executable, and it would end up getting compiled for a configuration it was never designed to work in.
In general, there's lots of ways that a tool can require transitions to work correctly between it and its dependencies, at minimum any cfg="exec"
dependency won't work correctly if all transitions are silent no-ops.
Or to put it differently
graph LR
multi[":multi_target"] -- transition --> x86
multi[":multi_target"] -- transition --> arm
x86[":target (x86)"] -- srcs --> gen[":generated_code (arch-agnostic)"]
arm[":target (arm)"] -- srcs --> gen
gen -- tool --> tool[":tool (x86)"]
tool -- dep --> lib[":lib (x86)"]
It's important that :tool
and :lib
are built in an x86 configuration so that we can actually execute them. In general, they might depend on other scripts that need to be built in their own configurations as well.
Our current solution of explicitly transitioning to "nocpu"
for :generated_code
handles this fine, and I'm not seeing a compelling reason to switch to something else.
The thing we can't do currently (and this might require a separate issue) is handle cases where some actions in a rule depend on (a piece of) the configuration but others don't. This is particularly common for C++, where you might want to build an object file once then link it against several different versions of its dependencies depending on the build configuration, and there's potentially a lot of speed improvement to be had from deduplicating that compile action. Solving this is a hard problem, unfortunately.
@gregestren Should we consider this completed?
Sure. This has become a Frankenstein bug: covers lots of interesting themes but with lost focus.
Let's continue open-ended discussion on paths & efficiency on a https://github.com/bazelbuild/bazel/discussions. And keep specific actionables in focused new issues.
Apologies if this comment misses someone's ongoing actionable concern: please remind us.
As an aside, I'd still like to flip --experimental_platform_in_output_dir
. That helps with the opposite problem of legitimately shareable actions not sharing because of forked paths. I almost got it flipped in Google a few weeks ago. But we found unexplained extra remote executor load that we need to diagnose before properly landing.
This "Frankenstein bug" have come back to haunt me a another lap. Could you point me to relevant discussions?
@gregestren What was the problem with allowing devs to set "output directory name" and let them control their own namespace while protected by action conflicts?
I'm not too excited about the new path-mapping track as it adds complexity, is limited in scope and doesn't address the design issue.
@gregestren What was the problem with allowing devs to set "output directory name" and let them control their own namespace while protected by action conflicts?
Unless I'm misunderstanding this will increase complexity and risk of action conflicts because it shifts the burden to devs to do it right. How do we ensure two transitions that set distinct configs don't set the same "output directory name" that gets consumed by some common library which then crashes?
@gregestren What was the problem with allowing devs to set "output directory name" and let them control their own namespace while protected by action conflicts?
Unless I'm misunderstanding this will increase complexity and risk of action conflicts because it shifts the burden to devs to do it right. How do we ensure two transitions that set distinct configs don't set the same "output directory name" that gets consumed by some common library which then crashes?
Causal users will not use the "output directory name" feature, once you use it is because you have analyzed your build and got into the details, possibly after transitions and system level multi-configuration builds. I see/use action conflicts as a feature, they force you to investigate the "why" and go from there as it has either detected a configuration issue or a namespace issue.
Your example with the "common library" is a perfect (and common) example. If it's not detected by build system tests the "crash" here is an action conflict and that will lead to a resolution according to above, a new build system test and maybe the need for a transition on the edge to the common library.
The opposite side of the coin is to not detect these issues, like configuration leaking down to dependencies it shouldn't or unnecessarily duplicating builds multiple times. This can even go undetected unless you hit a cache or disc space issue. It's non-trivial for a developer in a big project to even know about these consequences when adding a build-setting, transition or command line option.
I see/use action conflicts as a feature,
I sympathize with this idea: that action conflicts can be a meaningful signal and not an obtuse failure.
I also sympathize that there's a tradeoff: auto-avoiding conflicts has its own efficiency consequences that are hard to trace. Not just efficiency but even correctness if a build just fails because of an OOM.
My biggest concern about this approach is connecting these failures with the people who are best informed and qualified to address them. This can get really hard when building projects with external dependencies that cross team and org boundaries. We've seen this sort of thing a lot at Google: even qualified language maintainers struggled to understand where conflicts happen or what to do about them when random project XYZ fails. Reducing the possibility of action conflicts seems to have helped with that repo.
Of course not all repos are the same. I'm genuinely curious how your perspective differs. And we of course have to be careful setting precedents that would apply to other orgs & repos with their own distributions of expertise.
even qualified language maintainers struggled to understand where conflicts happen or what to do about them when random project XYZ fails
I sympathize with this. I've thought long and hard on these issues and put considerable time on tooling and integration support. My experience is that these issues have similar complexity in root cause analysis and resolution as the issues on the "other side of the coin" as described above. Going with- or without- transitions for various things was for me once even an unstable state during a build-system migration. There was no one solution fits all either.
Of course not all repos are the same. I'm genuinely curious how your perspective differs. And we of course have to be careful setting precedents that would apply to other orgs & repos with their own distributions of expertise.
For illustration, let's scale up issue in the problem description with subsystems instead of targets to "bazel build //car" level which I recently saw in a presentation at a fair. The build system details were not presented but lets take it all the way with one build tree for everything. A developer in this scenario is tuning a new ignition control algorithm and wants to change a few lines of code and do: $ bazel run @car//:test_in_car --@engine//:ignition_control=@ignition//ignition_algorithms:new
With this perspective it's easy to see the consequences if the developer adds a "-c dbg" or "--features=sanitize" on the command line. It will trickle down to everything, including dashboard software and if lucky the build fails on a flash size check somewhere.
If we consider all transitions needed for different hardware- and configuration- setups throughout this system we can also imagine what the ST-hashes in the output directory name is doing for code duplication on a common dependency with only a few configurations of it's own. I've seen tools and common libs being generated and built in multiple ST- and exec- output folders on a much smaller system. What if the same developer above makes a test with a transition that sweeps a build setting value affecting a scheduler parameter N times?
As it might be hard to resolve an action conflict for a linking step in athe ignition control software above it can be equally hard to detect the consequences of an inefficient namespace and even harder to resolve it as it may require patches in the end.
Description of the problem / feature request:
Bazel errors out if the same action (with identical action hash) is created with different configuration hash. Configuration hash should be irrelevant when considering if two actions are identical.
The graph in the minimal example below shows the issue where B and C sets different values of a build setting that controls the output of E: A -> B -> D -> E A -> C -> D -> E
Case 1: Using bazel with current ST-folders on above graph yields duplicated actions for target D:
Case 2: To avoid duplicated actions in D, we can control the "output directory name" through transitions and thus remove the ST-hash. E avoids action conflicts by consider the build setting value in it's output path. This produces the the error in the topic.
Note that his works for bazel 4.x but this functionality was dropped in af0e20f9d9808d8fba03b712491ad8851b3d5c26. Later versions need a revert of that commit to work.
Case3: Today we need, as a workaround, have the same value of the conflicting build setting in all paths to D. In this example we reset the value to default value at the input of D. This has the unwanted consequence that E can no longer be a dependency of D.
Both versions of E are gone and both B and C links to an unintended default version. To workaround 3, we have to rearrange the dependency graph in a non-optimal way, possibly with more targets than desired.
Feature requests: what underlying problem are you trying to solve with this feature?
To allow functionality as explained in the example without duplicated actions or action conflicts.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Start with an empty directory and run the following script inside it. For usage, see description above.
config_hash_action_conflict_example_setup.sh.txt
What operating system are you running Bazel on?
Ubuntu 20.04
What's the output of
bazel info release
?development version
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.Used current HEAD from master branch on github
$ git checkout e8a066e9e625a136363338d10f03ed14c26dedfa $ git revert af0e20f9d9808d8fba03b712491ad8851b3d5c26 $ # fix conflicts $ git revert --continue
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?$ git remote get-url bazelbuild; git rev-parse bazelbuild/master ; git rev-parse HEAD ; git rev-parse HEAD~1 https://github.com/bazelbuild/bazel.git e8a066e9e625a136363338d10f03ed14c26dedfa 4d2f7a74ead44aecf80f1b0b271f2b9fa2816d01 e8a066e9e625a136363338d10f03ed14c26dedfa
Have you found anything relevant by searching the web?
Any other information, logs, or outputs that you want to share?
Similar to #14023, solution to this issue should make #13587 obsolete.