Output directory hash differs if the same build settings are applied via transition vs. command line

bazelbuild / bazel

a fast, scalable, multi-language and extensible build system

https://bazel.build

Apache License 2.0

23.23k stars 4.07k forks source link

Output directory hash differs if the same build settings are applied via transition vs. command line #12171

Closed amberdixon closed 3 years ago

amberdixon commented 4 years ago

Description of the problem / feature request:

When the output directory hash for the bazel-out subdirectory for generated artifacts is computed on a transition, it is based upon a delta of the build settings before and after the transition. As a result, multiple output directories with different hashes are getting generated for targets that are built with the exact same build settings.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

==> BUILD.bazel <==
load(":stuff.bzl", "fr")

fr(
    name = "FW",
)

==> stuff.bzl <==
def _tr_impl(settings, attr):
    return {
        "//command_line_option:cpu": "ios_x86_64",
    }

tr = transition(
    implementation = _tr_impl,
    inputs = ["//command_line_option:cpu"],
    outputs = ["//command_line_option:cpu"],
)

def _fr_impl(ctx):
    return []

fr = rule(
    cfg = tr,
    implementation = _fr_impl,
    attrs = {
        "_whitelist_function_transition": attr.label(
            default = "@bazel_tools//tools/whitelists/function_transition_whitelist",
        ),
    },
)

Run these commands:
> bazelisk cquery FW
# output includes "//sample:FW (3079a19075a19231a074cb24e8a1e6d40cc62546a3a4a625aed22fcbdfa6c465)"
> bazelisk cquery FW --cpu=ios_x86_64
# output includes "//sample:FW (e4ca851ff6d7872e00a13f59323655db1a538255f7c9f581cbc45ab47b005464)"
> bazelisk  config 3079a19075a19231a074cb24e8a1e6d40cc62546a3a4a625aed22fcbdfa6c465 > /tmp/a
> bazelisk  config e4ca851ff6d7872e00a13f59323655db1a538255f7c9f581cbc45ab47b005464 > /tmp/b
> diff /tmp/a /tmp/b
1,2c1,2
< BuildConfiguration 3079a19075a19231a074cb24e8a1e6d40cc62546a3a4a625aed22fcbdfa6c465:
< Skyframe Key: BuildConfigurationValue.Key[3079a19075a19231a074cb24e8a1e6d40cc62546a3a4a625aed22fcbdfa6c465]
---
> BuildConfiguration e4ca851ff6d7872e00a13f59323655db1a538255f7c9f581cbc45ab47b005464:
> Skyframe Key: BuildConfigurationValue.Key[e4ca851ff6d7872e00a13f59323655db1a538255f7c9f581cbc45ab47b005464]
24c24
<   affected by starlark transition: [cpu]
---
>   affected by starlark transition: []
79c79
<   transition directory name fragment: ST-20f68df4034c
---
>   transition directory name fragment: null

Observe that the transition directory name fragments are different, even though the other build settings are the same.

What operating system are you running Bazel on?

Mac OS 10.15.6.

What's the output of `bazel info release`?

release 3.5.0

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

It does not include development version.

What's the output of `git remote get-url origin ; git rev-parse master ; git rev-parse HEAD` ?

I am not running bazel from a local git checkout of the bazel repo. I am using the release build.

Have you found anything relevant by searching the web?

The FunctionTransitionUtil.java file in bazel is intentionally computing the output directory hash only based upon the options that are affected by the starlark transition. It looks like this commit updated bazel to look at the delta of build settings: https://github.com/bazelbuild/bazel/commit/6360ff7ed15c2008c534d58022da4b0a2c739e6e cc @juliexxia

Any other information, logs, or outputs that you want to share?

In this branch, run repro.sh to reproduce the problem: https://github.com/segiddins/bazel-config-multiple-hashes.git My team has confirmed this is a regression since bazel version 3.2.0. In this PR, I also reproduced the problem: https://github.com/bazel-ios/rules_ios/pull/124

segiddins commented 4 years ago

cc @juliexxia, this is a regression we found in Bazel 3.5 in the configuration workstream, and I think you're the right person to triage this?

juliexxia commented 4 years ago

Yep! apologies, I'm a little confused by the issue: historically, starlark transitions always updated the output directory. What 6360ff7 did was make it so that this only happened in cases where options were actually given new values. The output directory of starlark transitioned targets being different isn't new so I'm confused about what the regression is here. Are you saying the output directory wasn't being updated in 3.2.0 and now is?

The FunctionTransitionUtil.java file in bazel is intentionally computing the output directory hash only based upon the options that are affected by the starlark transition. It looks like this commit updated bazel to look at the delta of build settings: 6360ff7 cc @juliexxia

Are you saying this is undesired behavior?

juliexxia commented 4 years ago

FWIW, you're right, the repro above should not have an updated transition directory fragment because as you say, it's the same configuration. But not having the updated transition directory fragment is new as of 6360ff7.

segiddins commented 4 years ago

So I think I can sum up the issue as: "bazel is using multiple output directories for configurations that are otherwise identical". This means targets are getting configured & built multiple times for identical configurations, and can mess up things like rules that try to do IDE configuration, because targets that should be unique in fact are showing up under multiple output directories.

This makes starlark configurations path dependent, in a sense, where the incoming configuration can have an effect on the output configuration's config hash. I don't think this is intentional? My expectation as a rule author is that "identical" configurations should always map to the config hash, and the set of affected by starlark transition flags should not, by itself, change the config hash.

juliexxia commented 4 years ago

So I think I can sum up the issue as: "bazel is using multiple output directories for configurations that are otherwise identical". This means targets are getting configured & built multiple times for identical configurations, and can mess up things like rules that try to do IDE configuration, because targets that should be unique in fact are showing up under multiple output directories.

Yep! This all makes sense and I agree. Before 6360ff7, this wasn't the case, we were hashing all the values that were set by transitions regardless of they were the same value as they came into the transition. After 6360ff7, what you're saying should hold true as we slimmed down the group of post-transition configurations that get an updated output directory to those where the transition actually changed the value of an option (by comparing the output value to the input value).

What I'm confused about is the timeline/regression. Starlark transitions have historically always updated the output directory and only recently, with 6360ff7, have started not always updating the output directory, only doing so when a flag is actually changed. So I'm confused about how you weren't seeing updated output directory paths before 6360ff7 and are seeing them now since it should be exactly the opposite. I wanted to clarify that's indeed what you're seeing so I can dig further.

segiddins commented 4 years ago

I think we maybe were seeing updating output directory paths, but they were being updated to the same value? Because all the outputs were being taken into consideration?

Again, pure speculation, but I feel like what needs to be done is look at the entire output configuration diffed against the "default" configuration, rather than only considering the diff of transition input/output flags against the incoming configuration? Not sure if that makes any sense.

So I'm confused about how you weren't seeing updated output directory paths before 6360ff7 and are seeing them now since it should be exactly the opposite

From looking at the diff, I have a hunch that the change from https://github.com/bazelbuild/bazel/commit/6360ff7ed15c2008c534d58022da4b0a2c739e6e#diff-01af620735f7a29048977650332a354cR415-R416 to only looking at inputs to looking at inputs and outputs might be the trigger?

juliexxia commented 4 years ago

I think we maybe were seeing updating output directory paths, but they were being updated to the same value? It would be really helpful to get confirmation/clarification about what you were seeing in old bazel that you're seeing differently in new bazel and why it's causing issues for you.

Again, pure speculation, but I feel like what needs to be done is look at the entire output configuration diffed against the "default" configuration, rather than only considering the diff of transition input/output flags against the incoming configuration? Not sure if that makes any sense.

That's the purpose of the "affected by starlark transition" bit. So the way the logic works:

# what happens when a transition is performed
maybe_new_values = get the map returned by the transition implementation function
this_time_affected_options = {}
for option, new_value in maybe_new_values:
  if new_value != input value:
    put option in affected_options

if affected_options != empty:
  String to_hash
  affected_options = affected_optins.add_all(this_time_affected_options)
  for option in affected_options
    to_hash+=option + value 
  transition directory name fragment = hash(to_hash)

The key here is that we maintain a list of affected options over an entire build and we only update the transition directory name fragment when a transition is performed. So given the following graph structure:

A I transition1 that sets --foo=bar v B I transition2 that sets --foo=blargh v C I transition3 that sets --zoom=zop v D

If you build A with --foo=bar, then *in theory** you get the following configured targets: A: transition directory name fragment = null, affected by starlark transition = {} B: transition directory name fragment = null, affected by starlark transition = {} C: transition directory name fragment = hash_of("foo=blargh"), affected by starlark transition = {foo} D: transition directory name fragment = hash_of("foo=blargh, zoom=zop"), affected by starlark transition = {foo,zop}

From looking at the diff, I have a hunch that the change from 6360ff7#diff-01af620735f7a29048977650332a354cR415-R416 to only looking at inputs to looking at inputs and outputs might be the trigger?

All this does is give the input dict to the transition access to the default values of the output settings. Build settings that are set to their default value are not stored in the configuration (so that configurations with non-set build settings and explicitly-set-to-default build settings look the same) so if we don't have a recorded value for the build setting in the configuration, we need to know what the default value of the setting is so we can either record the new value or recognize the new value as the default and continue to ignore it.

juliexxia commented 4 years ago

I said *in theory** in the previous post because it sounds like you're getting different results. But again, need some clarity on what you're seeing.

segiddins commented 4 years ago

What would help provide that clarity? Given I have the sort of single-file repro, I'm happy to do whatever debugging / logging you think would be helpful

juliexxia commented 4 years ago

Do you mind verifying for me that you're running Bazel 3.5.0? Is it possible to run it with Bazel 3.5.0 proper and see if you get the same results? I notice you're running bazelisk - are you setting the bazel version anywhere? See: https://github.com/bazelbuild/bazelisk#how-does-bazelisk-know-which-bazel-version-to-run

I'm running the following repro (adapted from yours) with regular bazel 3.5.1 and I'm not seeing the same results you are:

==> BUILD.bazel <==
load(":stuff.bzl", "fr", "other")

fr(
    name = "FW",
)

other(name = "other")

==> stuff.bzl <==
def _tr_impl(settings, attr):
    return {
        "//command_line_option:cpu": "ios_x86_64",
    }

tr = transition(
    implementation = _tr_impl,
    inputs = [],
    outputs = ["//command_line_option:cpu"],
)

def _fr_impl(ctx):
    return []

fr = rule(
    cfg = tr,
    implementation = _fr_impl,
    attrs = {
        "_whitelist_function_transition": attr.label(
            default = "@bazel_tools//tools/whitelists/function_transition_whitelist",
        ),
    },
)

other = rule(implementation = _fr_impl)

$ bazel cquery :all --cpu=ios_x86_64
INFO: Analyzed 3 targets (0 packages loaded, 38 targets configured).
INFO: Found 3 targets...
//:FR (9ea510b1ff7233388168cb46511ecff9b13a00b0bd26a252d2099dc6b8a22a89)
//:other (9ea510b1ff7233388168cb46511ecff9b13a00b0bd26a252d2099dc6b8a22a89)
INFO: Elapsed time: 0.088s
INFO: 0 processes.
INFO: Build completed successfully, 0 total actions

$ bazel config 9ea510b1ff7233388168cb46511ecff9b13a00b0bd26a252d2099dc6b8a22a89 | grep "transition"
INFO: Displaying config with id 9ea510b1ff7233388168cb46511ecff9b13a00b0bd26a252d2099dc6b8a22a89
  affected by starlark transition: []
  transition directory name fragment: null

juliexxia commented 4 years ago

Another helpful verification would be at what bazel version were you not seeing this behavior because I would expect this behavior to be in all bazel versions before 3.5.0

segiddins commented 4 years ago

https://github.com/segiddins/bazel-config-multiple-hashes.git shows this is a regression between 3.4.0 and 3.5.0

juliexxia commented 4 years ago

There seem to be some build failures In that repro. Where are the config hashes that you use at the end coming from? I don't see them show up anytime before the config command:

"bazel config 1908601e8157a3765e39ede2fc98d23c9c1be24080e5589fc71fa6a15d7d9ff1 28e83bd7c518774a6378a7ee5024c969ab10fdacb062577d0f528bb70b3e6917"

Can you run the repro I wrote above? It's smaller and should give a clear signal.

segiddins commented 4 years ago

There seem to be some build failures In that repro

The build failures are intentional -- that's how I grab the pair of configs to diff.

I or @amberdixon will try to run your repro.

amberdixon commented 4 years ago

Hi Julie! Sorry about the delayed response here. Here are my responses to your comments and my notes about the repro case.

Are you saying the output directory wasn't being updated in 3.2.0 and now is?

That's fine that the output directory gets updated. But it seems like the parameters used to compute the new directory hash should be a function of all the build settings in the new configuration, not just the changed settings. If you only look at the changed settings on the transition, then different hashes will end up getting used for targets that are actually configured with the same settings.

Are you saying this is undesired behavior? After 6360ff7, what you're saying should hold true as we slimmed down the group of post-transition configurations that get an updated output directory to those where the transition actually changed the value of an option (by comparing the output value to the input value).

The intent of 6360ff7 is to not change the output directory hash on a no-op. That is perfectly fine. However the bug (we think) is that this commit computes the hash from the changed build settings, not from all of them. This becomes a problem with one or more build settings changes. Therefore, this commit is not just affecting no-op transitions .. it's affecting all transitions.

You presented an example with these targets and transitions:

A: transition directory name fragment = null, affected by starlark transition = {} B: transition directory name fragment = null, affected by starlark transition = {} C: transition directory name fragment = hash_of("foo=blargh"), affected by starlark transition = {foo} D: transition directory name fragment = hash_of("foo=blargh, zoom=zop"), affected by starlark transition = {foo,zop}

For D, the transition directory name fragment will only be hash_of("zoom=zop"). This is beacuse foo=blargh has not changed on the transition between C -> D.

Let's consider the following scenario. In your dependency graph, let's say there is another way of getting to D from E, as follows:

E I transition3 that sets --zoom=zop, --foo=blargh v D

In either case, D is configured with --zoom=zop, --foo=blargh. But we see that transitioning from C->D vs. from E->D results in a different output directory has for D. C: transition directory name fragment = hash_of("foo=blargh"), affected by starlark transition = {foo} D: transition directory name fragment = hash_of("zoom=zop"), affected by starlark transition = {zop}

E: transition directory name fragment = null, affected by starlark transition = {} D: transition directory name fragment = hash_of("foo=blargh, zoom=zop"), affected by starlark transition = {foo,zop}

All this does is give the input dict to the transition access to the default values of the output settings. Build settings that are set to their default value are not stored in the configuration (so that configurations with non-set build settings and explicitly-set-to-default build settings look the same) so if we don't have a recorded value for the build setting in the configuration, we need to know what the default value of the setting is so we can either record the new value or recognize the new value as the default and continue to ignore it.

Perhaps the problem here is that the code doesn't actually know what the "default" settings are. Therefore it regards any unchanged setting as a "default."

Confirmed we are using Bazel 3.5.0 with bazelisk, as follows:

~/Development/rules_ios/sample master * cat ../.bazelversion 
3.5.0
~/Development/rules_ios/sample master * ls ../WORKSPACE # the workspace root is at the rules_ios directory
../WORKSPACE
~/Development/rules_ios/sample master * ls WORKSPACE
ls: WORKSPACE: No such file or directory
~/Development/rules_ios/sample master * ls
BUILD.bazel stuff.bzl

In your repro case, you ran bazel cquery :all --cpu=ios_x86_64. However, in my repro case, I first ran bazel cquery without the cpu=ios_x86_64 flag. That caused the transition to kick in and set the cpu flag. Since the cpu flag was being changed, its value affected the output directory hash.

Subsequently, I ran bazel cquery with the cpu=ios_x86_64 flag. In this case, the transition did not change anything. Therefore, no new output directory hash was computed.

But now unfortunately we find that there are different hashes being used for FW, even though in either case, it's being built with the same CPU flag.

I hope that helps illustrate the problem!

On my end, I will try to repro this problem on an older bazel version and also on 3.5.1.

juliexxia commented 4 years ago

For D, the transition directory name fragment will only be hash_of("zoom=zop"). This is because foo=blargh has not changed on the transition between C -> D.

This isn't true because the "affected by starlark transition" is a list that is maintained and grown over the entire build. This list is what's used to determine what options will be hashed to form the output directory. So what happens in the code is that D should have the same output directory regardless of it was built via E or C (assuming they set the same values) because in both cases, the list of "affected by starlark transition" contains both foo and zop

Here's the code that adds the newly updated options to the running list of options affected by starlark options https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java;l=373?q=FunctionTransitionUtil&ss=bazel

Here's the code that runs over the entire running list of changed options, not just the newly updated options: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/starlark/FunctionTransitionUtil.java;l=385?q=FunctionTransitionUtil&ss=bazel

In your repro case, you ran bazel cquery :all --cpu=ios_x86_64. However, in my repro case, I first ran bazel cquery without the cpu=ios_x86_64 flag. That caused the transition to kick in and set the cpu flag. Since the cpu flag was being changed, its value affected the output directory hash. Subsequently, I ran bazel cquery with the cpu=ios_x86_64 flag. In this case, the transition did not change anything. Therefore, no new output directory hash was computed.

What is the unexpected behavior here? The behavior of the output directory hash changing without --cpu=ios_x86_64 set on the command line and not changing with --cpu=ios_x86_64 set on the command line is what I would expect to happen.

amberdixon commented 4 years ago

This isn't true because the "affected by starlark transition" is a list that is maintained and grown over the entire build. This list is what's used to determine what options will be hashed to form the output directory. So what happens in the code is that D should have the same output directory regardless of it was built via E or C (assuming they set the same values) because in both cases, the list of "affected by starlark transition" contains both foo and zop

Ah yes, you are right, the "affected by starlark transition" does seem to aggregate settings that are affected by transitions. However, I still think there are some problems with the current implementation. In my company's repo, I have a scenario where all the build settings for a target are the same, regardless of where we traverse into it from:

AppTarget -> D TestTarget -> D D

In each case, D ends up with the same set of build settings. And yet 3 different directory hashes are generated and the "affected by starlark transition" list looks different in each case. I'll work on providing you with a repro case for this scenario.

What is the unexpected behavior here? The behavior of the output directory hash changing without --cpu=ios_x86_64 set on the command line and not changing with --cpu=ios_x86_64 set on the command line is what I would expect to happen.

In the original repro case I presented, we see the FW is being built with the same value for cpu. We see that there is no diff in /tmp/a and /tmp/b, other than which fields were affected by starlark transition. At the end of the day, the built artifact is being built with the same settings and therefore should map to the same output directory hash (I think).

juliexxia commented 4 years ago

In each case, D ends up with the same set of build settings. And yet 3 different directory hashes are generated and the "affected by starlark transition" list looks different in each case. I'll work on providing you with a repro case for this scenario.

Gotcha, that doesn't sound right. Thanks in advance for the repro. That would be very helpful. Is it possible the flags on the original command line/blazerc is different in the different cases? That would also lead to different directory hashes even if D ended up in the same configuration across the builds.

In the original repro case I presented, we see the FW is being built with the same value for cpu. We see that there is no diff in /tmp/a and /tmp/b, other than which fields were affected by starlark transition. At the end of the day, the built artifact is being built with the same settings and therefore should map to the same output directory hash (I think).

The last sentence is correct! Thanks for clarifying. Using your repro I was unable to get the same results from my end, I got the expected same output directory hash. Not sure what's going on but hopefully a further repro will help us get to the bottom of it.

It would be helpful if you could try setting up and running the series of command I posted above as well (in https://github.com/bazelbuild/bazel/issues/12171#issuecomment-702313974)

amberdixon commented 4 years ago

I have tried the following on bazel versions 3.5.0 and 3.5.1, but am seeing the same behavior.

I have a scenario where all the build settings for a target are the same, regardless of where we traverse into it from:

AppTarget -> D TestTarget -> D D

In each case, D ends up with the same set of build settings. And yet 3 different directory hashes are generated and the "affected by starlark transition" list looks different in each case. I'll work on providing you with a repro case for this scenario.

I think this particular issue has to do with the way the AppleCrosstoolTransition from the AppleBinary rule in the bazel repo gets applied. (There is different from the issue I reported in the initial description in the bug report, where the directory output hash differs, depending on whether or not the build setting was applied via the command line or via a transition.)

Here's how to repro the issue with AppleCrosstoolTransition problem:

==> stuff.bzl <==

def _tor_impl(settings, attr):
    return {
        "//command_line_option:cpu": "ios_x86_64", 
    }

tor = transition(
    implementation = _tor_impl,
    inputs = [],
    outputs = [
      "//command_line_option:cpu",
    ],

)

def _ftor_impl(ctx):
    return []

ftor = rule(
    implementation = _ftor_impl,
    attrs = {
      "deps": attr.label_list(cfg=tor),
        "_whitelist_function_transition": attr.label(
            default = "@bazel_tools//tools/whitelists/function_transition_whitelist",
        ),
    },
)

==> BUILD.bazel <==
load("@build_bazel_rules_apple//apple:ios.bzl", "ios_application")
load("@rules_cc//cc:defs.bzl", "objc_library")
load(":stuff.bzl", "ftor")
load(":stufforig.bzl", "fr")

fr(
    name = "FWorig",
)

exports_files([
    "Info.plist",
])

objc_library(
    name = "D",
    srcs = glob(["FW.h", "FW.m"]),
)

ftor(
  name="TestTarget",
  deps = [":D"],
)

ios_application(
  name = "AppTarget",
  deps = [":D"],
  bundle_id = "ios.app",
  infoplists = [":Info.plist"],
  families = [
      "iphone",
      "ipad",
  ],
  minimum_os_version = "12.0",
)

In this case, you see that D is hashing to 3 different directory hashes:

~/sample-bug-report-bazel bazelisk cquery :all
INFO: Analyzed 6 targets (1 packages loaded, 13 targets configured).
INFO: Found 6 targets...
//:AppTarget (df1be34ef9742c8a8fbf877248919bd886d53a7454913b50c6d6349e4be2a8b5)
//:AppTarget.__internal__.apple_binary (e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511)
//:AppTarget.swift_runtime_linkopts (df1be34ef9742c8a8fbf877248919bd886d53a7454913b50c6d6349e4be2a8b5)
//:AppTarget.swift_runtime_linkopts (fcdd87f586a42f4db4b5f3495f5afb0aa75a975f985cbdb6fb009552b188c547)
//:AppTarget_entitlements (df1be34ef9742c8a8fbf877248919bd886d53a7454913b50c6d6349e4be2a8b5)
//:AppTarget_entitlements (fcdd87f586a42f4db4b5f3495f5afb0aa75a975f985cbdb6fb009552b188c547)
//:D (bb37f927e397a8b32cb6c9b78a0ff679ed31cb2fe60c868e9034d1c1337b3ece)
//:D (e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511)
//:D (fcdd87f586a42f4db4b5f3495f5afb0aa75a975f985cbdb6fb009552b188c547)
//:TestTarget (df1be34ef9742c8a8fbf877248919bd886d53a7454913b50c6d6349e4be2a8b5)
INFO: Elapsed time: 0.219s
INFO: 0 processes.
INFO: Build completed successfully, 0 total actions
~/sample-bug-report-bazel

Notice that D has 3 different configs. For the config bb37f927e397a8b32cb6c9b78a0ff679ed31cb2fe60c868e9034d1c1337b3ece, D is getting built via an ongoing edge transition from TestTarget, which applies the build setting "cpu." For the config e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511 however, D is getting built via AppTarget, which applies AppleCrosstoolTransition. In either case, the build settings for D do not differ. But in the first case, the "affected by starlark transition" array includes cpu, whereas in the second case it does not.

~/sample-bug-report-bazel bazelisk config bb37f927e397a8b32cb6c9b78a0ff679ed31cb2fe60c868e9034d1c1337b3ece e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511
INFO: Displaying diff between configs bb37f927e397a8b32cb6c9b78a0ff679ed31cb2fe60c868e9034d1c1337b3ece and e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511
Displaying diff between configs bb37f927e397a8b32cb6c9b78a0ff679ed31cb2fe60c868e9034d1c1337b3ece and e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511
FragmentOptions com.google.devtools.build.lib.analysis.config.CoreOptions {
  affected by starlark transition: [cpu], []
  transition directory name fragment: ST-20f68df4034c, null
}

This is a problem, since in the second case, the AppleCrossTool transition is actually updating the cpu build setting:

~/sample-bug-report-bazel bazelisk cquery 'allpaths(AppTarget, D)' --transitions=full
INFO: Analyzed 2 targets (0 packages loaded, 0 targets configured).
INFO: Found 2 targets...
AppleCrosstoolTransition -> //:D (fcdd87f586a42f4db4b5f3495f5afb0aa75a975f985cbdb6fb009552b188c547)
ComposingTransition -> //:AppTarget.__internal__.apple_binary (e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511)
  deps#//:D#((AppleBinaryTransition + AppleCrosstoolTransition) + (TestTrimmingTransition + ConfigFeatureFlagTaggedTrimmingTransition)) ( -> 82a7b165ed9b0c3f60255615cb8e9f7976b06f58eff45ab48d0bcdd99e4e1bb1)
    platforms:[@local_config_platform//:host] -> [[]]
    cpu:darwin_x86_64 -> [ios_x86_64]
    apple configuration distinguisher:APPLE_CROSSTOOL -> [APPLEBIN_IOS]
    apple_platform_type:macos -> [ios]
    apple_split_cpu: -> [x86_64]
    ios_minimum_os:null -> [com.google.devtools.build.lib.rules.apple.DottedVersion$Option@1707a3]
  deps#//:AppTarget.swift_runtime_linkopts#(AppleBinaryTransition + (TestTrimmingTransition + ConfigFeatureFlagTaggedTrimmingTransition)) ( -> 82a7b165ed9b0c3f60255615cb8e9f7976b06f58eff45ab48d0bcdd99e4e1bb1)
    platforms:[@local_config_platform//:host] -> [[]]
    cpu:darwin_x86_64 -> [ios_x86_64]
    apple configuration distinguisher:APPLE_CROSSTOOL -> [APPLEBIN_IOS]
    apple_platform_type:macos -> [ios]
    apple_split_cpu: -> [x86_64]
    ios_minimum_os:null -> [com.google.devtools.build.lib.rules.apple.DottedVersion$Option@1707a3]
//:AppTarget.swift_runtime_linkopts (fcdd87f586a42f4db4b5f3495f5afb0aa75a975f985cbdb6fb009552b188c547)
  deps#//:D#(AppleCrosstoolTransition + (TestTrimmingTransition + ConfigFeatureFlagTaggedTrimmingTransition)) ( -> fcdd87f586a42f4db4b5f3495f5afb0aa75a975f985cbdb6fb009552b188c547)
//:AppTarget (df1be34ef9742c8a8fbf877248919bd886d53a7454913b50c6d6349e4be2a8b5)
  deps#//:AppTarget.__internal__.apple_binary#((AppleCrosstoolTransition + NoTransition) + (TestTrimmingTransition + ConfigFeatureFlagTaggedTrimmingTransition)) ( -> 114a3a25c1d8e83729ab0498cf22b9e0edbba990567fc771ebf62ff4ce51fa56)
    platforms:[@local_config_platform//:host] -> [[]]
    cpu:darwin -> [darwin_x86_64]
    apple configuration distinguisher:UNKNOWN -> [APPLE_CROSSTOOL]
INFO: Elapsed time: 0.103s
INFO: 0 processes.
INFO: Build completed successfully, 0 total actions

Is it possible the flags on the original command line/blazerc is different in the different cases? That would also lead to different directory hashes even if D ended up in the same configuration across the builds.

In the example above, we don't have different command line or bazelrc setting ... but isn't it a bug that would result in different directory hashes, even if D ended up in the same configuration across builds?

It would be helpful if you could try setting up and running the series of command I posted above as well (in #12171 (comment))

I ran your repro steps. I saw the same result as you. However, in order to repro this issue, I think you would need to run the following steps. First, run everything with the cpu command line flag, just like you did. Then try running it without the command line flag. You will notice that FW is now pointing to another output directory hash ... despite the fact that the build settings for FW have not changed (in the second case, the cpu build setting for FW gets applied from the transition, not the command line).

~/sample-bug-report-bazel bazelisk cquery --cpu=ios_x86_64 :all
INFO: Build option --cpu has changed, discarding analysis cache.
INFO: Analyzed 2 targets (0 packages loaded, 9 targets configured).
INFO: Found 2 targets...
//:FW (9a6098a784cca06503a5a50a9b561f408b1e05a1d2fa52a39f25e778b3fe29c8)
//:other (9a6098a784cca06503a5a50a9b561f408b1e05a1d2fa52a39f25e778b3fe29c8)
INFO: Elapsed time: 0.105s
INFO: 0 processes.
INFO: Build completed successfully, 0 total actions
~/sample-bug-report-bazel bazelisk cquery :all
INFO: Build option --cpu has changed, discarding analysis cache.
INFO: Analyzed 2 targets (0 packages loaded, 14 targets configured).
INFO: Found 2 targets...
//:FW (fc05326ef17847a4afeca3b58dfb49098eb8bb048b976e7bceb4e1309e4727e0)
//:other (df1be34ef9742c8a8fbf877248919bd886d53a7454913b50c6d6349e4be2a8b5)
INFO: Elapsed time: 0.086s
INFO: 0 processes.
INFO: Build completed successfully, 0 total actions
~/sample-bug-report-bazel bazelisk config 9a6098a784cca06503a5a50a9b561f408b1e05a1d2fa52a39f25e778b3fe29c8 fc05326ef17847a4afeca3b58dfb49098eb8bb048b976e7bceb4e1309e4727e0
INFO: Displaying diff between configs 9a6098a784cca06503a5a50a9b561f408b1e05a1d2fa52a39f25e778b3fe29c8 and fc05326ef17847a4afeca3b58dfb49098eb8bb048b976e7bceb4e1309e4727e0
Displaying diff between configs 9a6098a784cca06503a5a50a9b561f408b1e05a1d2fa52a39f25e778b3fe29c8 and fc05326ef17847a4afeca3b58dfb49098eb8bb048b976e7bceb4e1309e4727e0
FragmentOptions com.google.devtools.build.lib.analysis.config.CoreOptions {
  affected by starlark transition: [], [cpu]
  transition directory name fragment: null, ST-20f68df4034c
}

amberdixon commented 4 years ago

Hi @juliexxia ! I was wondering if my latest comment was helpful. I was hoping that the bazel team could address the issue where different directory hashes are used for artifacts built with the same set of build settings. We do see that different hashes are getting used when the build settings are applied from the command line vs. from a transition. It has necessitated a number of workarounds on our end. Thanks!

juliexxia commented 4 years ago

Hi! Apologies it's been a busy couple of weeks. I'll read through you last couple of responses and get back to you in the next couple of days.

juliexxia commented 4 years ago

Notice that D has 3 different configs. For the config bb37f927e397a8b32cb6c9b78a0ff679ed31cb2fe60c868e9034d1c1337b3ece, D is getting built via an ongoing edge transition from TestTarget, which applies the build setting "cpu." For the config e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511 however, D is getting built via AppTarget, which applies AppleCrosstoolTransition. In either case, the build settings for D do not differ. But in the first case, the "affected by starlark transition" array includes cpu, whereas in the second case it does not.

AppleCrosstoolTransition is a native transition defined in java code: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/rules/objc/AppleCrosstoolTransition.java;l=39?q=AppleCrosstoolTransition&ss=bazel. The affected by starlark transition andtransition directory name fragment bits are only set by starlark transitions. Which leads to...

In the example above, we don't have different command line or bazelrc setting ... but isn't it a bug that would result in different directory hashes, even if D ended up in the same configuration across builds

It's not ideal. Ideally we would have perfectly deduplicated outputs, but it's not technically a bug. Bazel doesn't actually guarantee any output directory structure; how we do output directories has changed over time and will continue to change. When it's difficult to get the de-duping just right, we prefer to play it safe and create new output paths instead of risking clobbering existing artifacts. For example, we use the affected by starlark transition andtransition directory name fragment as an overly cautious way of making sure starlark transitions aren't overwriting anything.

You will notice that FW is now pointing to another output directory hash ... despite the fact that the build settings for FW have not changed

But the build settings for FW have changed in the second case because the default value of --cpu is not ios_x86_64, it's one of the values here depending on your OS: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/config/AutoCpuConverter.java;l=35;drc=a5a4f47295ccb6cd81cf8732772c849b1a8b9d58. Let's say the default value of --cpu is detected as "k8" for examples sake.

In the bazelisk cquery --cpu=ios_x86_64 :all case, FW and other have the same configuration hash because the transition sets --cpu to to same value as the command line - we register that as a no change and don't update anything. Both FW and other are built with --cpu=ios_x86_64.

In the ~/sample-bug-report-bazel bazelisk cquery :all case, FW and other have different configuration hashes because for :FW, --cpu=ios_x86_64 and for :other, --cpu=k8.

The key here is affected by starlark transition updates to all native flags that are changed in value from the command line configuration.

segiddins commented 4 years ago

I think the motivating problem we're facing is that, before this change, it was possible for a starlark transition to force a target into a canonical configuration, so when it appeared in the configured target graph multiple times, it would always be a single target. That meant, for example, you wouldn't try to link two (identical) static libraries -- which is super important when building native code! Now, because of the native options, there's no way to force a single configuration for a target -- either you get a consistent value for the native option (and an inconsistent affected by starlark transition), or vice-versa.

I think that gets at the crux of the issue -- with this change, there's no way for us to write custom starlark transitions that interplay safely with the native transitions.

juliexxia commented 4 years ago

A couple questions -

Does the bug title still accurately represent what we're talking about here? If not, can it be updated?

before this change, it was possible for a starlark transition to force a target into a canonical configuration, so when it appeared in the configured target graph multiple times, it would always be a single target.

Which change are you talking about specifically? Can you give a minimal example of the multiple configured targets that used to all have the same configuration and now don't?

Historically it's never been possible for starlark transitions to put a configured target into any configuration that a native transition could but it in. Historically, transition directory name fragment has always been updated by a starlark transition and never updated by a native transition. Recently I made a change so that starlark transitions that don't actually change anything (e.g. set --cpu=k8 when --cpu already =k8) don't update transition directory name fragment or affected by starlark transition. So in the new world we are strictly closer to the concept of starlark transitions being able to produce configurations that are the same as native transitions.

segiddins commented 4 years ago

Sorry, I didn't mean put in the same configuration as a native transition. I meant it was possible to force a target with a star lark transition into a single (given a reasonable universe of incoming configurations) output configuration. Eg something that looked like a constant transition (same output hash given some restricted set of inputs) would lead to the same output configuration. This is no longer true because now only changed settings get taken into consideration when computing the directory fragment, meaning a constant transition can still result in multiple fragments, depending on the input configurations. @amberdixon To correct me if I'm wrong

juliexxia commented 4 years ago

This is no longer true because now only changed settings get taken into consideration when computing the directory fragment, meaning a constant transition can still result in multiple fragments, depending on the input configurations. @amberdixon To correct me if I'm wrong

Thank you, that's clarifying. Is there a scenario in a single build where this is the case? For example:

either you get a consistent value for the native option (and an inconsistent affected by starlark transition), or vice-versa.

it would be helpful to have a repro and command line that shows this happening.

juliexxia commented 4 years ago

Hey all, I think I've finally wrangled a small repro of the situation you're discussing and a fix is on the way. Stay tuned.

jerrymarino commented 3 years ago

I spent some time looking at this issue in the context of rules_ios and it looks like there are a few cases that result in "different transitions on the CLI".

rules_ios VS rules_apple transition deltas

First different transitions were applied differently when swift_library was built with ios_application vs rules_ios's apple_framework from either the CLI or in Bazel. This was fixed in the following PRs by auditing transitions / configs and fixing the cases where they were different:

adding apple_common.multi_arch_split to apple_framework.deps - https://github.com/bazel-ios/rules_ios/pull/188
the rules_ios transition yields the same result when used rules_apple - https://github.com/bazel-ios/rules_ios/pull/196

There is a test case in this PR https://github.com/bazel-ios/rules_ios/pull/196

Redundant builds of ios_application deps due to no transitions

Building a dependency of an ios_applicaiton will net a different build when the dep is built directly via CLI or with another top level target type. This is because rules_apple and rules_ios apply a iOS specific transition. A workaround is to build the target in question along side a rules_apple.ios_unit_test, rules_apple.ios_applicaiton, and now a rules_ios.apple_framework. IMO it'd be cool if we could apply iOS transitions to the CLI build in the same way they were applied with the ios_application rule.

mobileink commented 3 years ago

Any news on this issue? I believe I have run into the same issue using my OCaml rules, and unfortunately it may be a show-stopper. The problem in a nutshell:

OCaml artifacts are not namespaced
OCaml compiles an interface file (.cmi) and implementation file (.cmx) separately
Critically, it records in each artifact a hash of the artifacts it depends on
So if e.g. the same interface file is compiled twice, in two different transition output dirs, then we get errors like:

Error: The files bazel-out/darwin-fastbuild/bin/parsing/__obazl/Tok.cmi
       and bazel-out/darwin-fastbuild-ST-7a1e0f003fa8/bin/gramlib/__obazl/Gramlib__Plexing.cmi
       make inconsistent assumptions over interface Util

This means that Tok.cmi and Gramlib__Plexing.cmi were compiled using different Util.cmi files, which we can verify by using a tool (ocamlobjinfo) to inspect the contents - each will have an entry for Util associated with a different hash (so the error message is not quite accurate.) Util gets compiled twice because transition functions create different output directories, even though Util is compiled in only one way.

The Bazel problem boils down to a diamond dependency. The primary build target depends (separately} on Tok and GramlibPlexing, which both depend on Util. But only GramlibPlexing involves transition functions, and I have been unable to find a way to make both Tok and Gramlib__Plexing depend on the same build of Util. I've tried making the transition function reset the configs to their initial state, matching the state in effect when Tok is built, to no avail.

Here's my cquery output:

digraph mygraph {
  node [shape=box];
  "//parsing:_CLexer (ab0e5c6)"
  "//parsing:_CLexer (ab0e5c6)" -> "//lib:_CWarnings (ab0e5c6)"
  "//parsing:_CLexer (ab0e5c6)" -> "//parsing:_CLexer.cmi (ab0e5c6)"
  "//parsing:_CLexer.cmi (ab0e5c6)"
  "//parsing:_CLexer.cmi (ab0e5c6)" -> "//gramlib:gramlib (ab0e5c6)"
  "//parsing:_CLexer.cmi (ab0e5c6)" -> "//parsing:_Tok (ab0e5c6)"
  "//parsing:_Tok (ab0e5c6)"
  "//parsing:_Tok (ab0e5c6)" -> "//parsing:_Tok.cmi (ab0e5c6)"
  "//parsing:_Tok.cmi (ab0e5c6)"
  "//parsing:_Tok.cmi (ab0e5c6)" -> "//interp:_NumTok (ab0e5c6)"
  "//lib:_CWarnings (ab0e5c6)"
  "//lib:_CWarnings (ab0e5c6)" -> "//lib:_Util (ab0e5c6)"
  "//interp:_NumTok (ab0e5c6)"
  "//interp:_NumTok (ab0e5c6)" -> "//interp:_NumTok.cmi (ab0e5c6)"
  "//interp:_NumTok.cmi (ab0e5c6)"
  "//interp:_NumTok.cmi (ab0e5c6)" -> "//lib:_Util (ab0e5c6)"
  "//lib:_Util (ab0e5c6)"
  "//lib:_Util (ab0e5c6)" -> "//lib:_Util.cmi (ab0e5c6)"
  "//lib:_Util.cmi (ab0e5c6)"
  "//gramlib:gramlib (ab0e5c6)"
  "//gramlib:gramlib (ab0e5c6)" -> "//gramlib:_Grammar (69e0337)"
  "//gramlib:gramlib (ab0e5c6)" -> "//gramlib:_Plexing (69e0337)"
  "//gramlib:_Grammar (69e0337)"
  "//gramlib:_Grammar (69e0337)" -> "//gramlib:_Grammar.cmi (69e0337)"
  "//gramlib:_Grammar.cmi (69e0337)"
  "//gramlib:_Grammar.cmi (69e0337)" -> "//gramlib:_Plexing (69e0337)"
  "//gramlib:_Plexing (69e0337)"
  "//gramlib:_Plexing (69e0337)" -> "//gramlib:_Plexing.cmi (69e0337)"
  "//gramlib:_Plexing.cmi (69e0337)"
  "//gramlib:_Plexing.cmi (69e0337)" -> "//lib:_Util (bcf0bde)"
  "//lib:_Util (bcf0bde)"
  "//lib:_Util (bcf0bde)" -> "//lib:_Util.cmi (bcf0bde)"
  "//lib:_Util.cmi (bcf0bde)"
}

Let me put this in the form of a question: what is the complete set of factors determining the output directory? The "config state" (build settings) obviously; anything else? Some of the OCaml rules involve actions to copy/rename/preprocess source files; could that have an effect?

It would be very, very helpful if I could find a way to make transitions work so I don't have to throw away months of work.

Thanks,

Gregg

gregestren commented 3 years ago

Sorry for the delay, @mobileink. Julie, who's the Bazel resident expert on this stuff, has shifted her focus off Bazel. We're trying to pick up the slack. Check out my comment at https://github.com/bazelbuild/bazel/issues/13317#issuecomment-819834807. Does that help?

Short story is there's lots we can, and should, and hopefully will (in a timely way) do to improve this. One of Julie's last contributions was to lay out a clear analysis of the status quo and possible improvements.

Sorry, I'm skimming through your comment and can't immediately tell if your scenario falls into

1) pure transition inefficiency that the above can improve 2) a dep structure that Bazel doesn't understand means the same target can be shared in both instances (the diamond scenario you describe). This would be harder.

Could you summarize your exact transition logic, and how/where it's consumed, a bit more?

mobileink commented 3 years ago

Thanks for the response. FYI I'm hoping to get back to this next week...

On Wed, Apr 14, 2021 at 4:25 PM Greg @.***> wrote:

Sorry for the delay, @mobileink https://github.com/mobileink. Julie, who's the Bazel resident expert on this stuff, has shifted her focus off Bazel. We're trying to pick up the slack. Check out my comment at #13317 (comment) https://github.com/bazelbuild/bazel/issues/13317#issuecomment-819834807. Does that help?

Short story is there's lots we can, and should, and hopefully will (in a timely way) do to improve this. One of Julie's last contributions was to lay out a clear analysis of the status quo and possible improvements.

Sorry, I'm skimming through your comment and can't immediately tell if your scenario falls into

pure transition inefficiency that the above can improve

a dep structure that Bazel doesn't understand means the same target can be shared in both instances (the diamond scenario you describe). This would be harder.

Could you summarize your exact transition logic, and how/where it's consumed, a bit more?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/12171#issuecomment-819844399, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAES2NG2KXWKNMAHJSNKCMDTIYB5JANCNFSM4RY3VHDA .

torgil commented 3 years ago

We currently use transition on the "//command_line_option:output directory name" to work around similar issues where different transition sequences are giving different output directory names (thus duplicating actions in D): A -> B -> D A -> C -> D E -> D

A build-setting set by B but set back to default value before reaching D gave different hashes for D in the different branches depending on if this build-setting was included in the hash or not.

In this case, we use this patch to avoid action conflicts due to different hashes: https://github.com/bazelbuild/bazel/pull/13587

gregestren commented 3 years ago

Hi all.

@sdtwigg and I are actively looking into improvements for this problem.

I think we all agree that a better implementation would set hashes as a pure diff from a baseline configuration, vs. examining the exact sequence of transitions down the graph and trying to keep their results consistent. In other words, we don't really care how we get to a particular configuration. We ultimately just care how that configuration differs from a baseline.

There are two promising candidates for that baseline:

1) The top-level configuration (i.e. the config matching the flags you pass at the command line) 2) The configuration representing the default values for all flags.

For example, if you're running on a Mac the default value for --cpu may be ppc but $ bazel build //:foo --cpu=arm has a top-level value of arm.

I lean toward 2) because I think it offers more consistency across builds, which may be helpful for remote execution caching.

@juliexxia sketched up some great thoughts about this a while ago. Maybe we could prep up a Google doc to consolidate our thoughts and approach.

I'm still unsure how this mixes with native transitions, which have their own independent logic for path names. Ideally we'd merge them all into the same algorithm. But that's a separate project.

Tactically, @sdtwigg is focusing on adding a new --experimental mode with a new algorithm like the above. Then we can all safely kick the tires and test our expectations before ubiquitizing it. I'm focusing on what I hope is the smaller, more quickly addressable sub-problem of Starlark flag changes affecting hashes even when the build never uses them.

I also want to highlight an interesting tension in this challenge:

If we unnecessarily fork configurations (two configs that are conceptually the same have different hashes), we lose cache opportunities and get slower builds
If we're too aggressive about merging hashes, we get action conflicts. That's a somewhat different problem but it can be triggered by the choices we make here.

gregestren commented 3 years ago

FYI for my part: I think https://github.com/bazelbuild/bazel/pull/13580 is the solution.

gregestren commented 3 years ago

I'm closing this for https://github.com/bazelbuild/bazel/pull/13580. There remain other optimizations to make, as reviewed at https://github.com/bazelbuild/bazel/issues/12171#issuecomment-912765906. Let's port over remaining concerns to better-scoped issues.

gregestren commented 3 years ago

@mobileink @jerrymarino can we sync to clarify the nature of your problems from the above conflicts?

I'm trying to prioritize improvements as mentioned in my previous comments. Understanding which use cases they would and wouldn't solve would help clarify.

mobileink commented 3 years ago

@mobileink @jerrymarino can we sync to clarify the nature of your problems from the above conflicts?

I'm trying to prioritize improvements as mentioned in my previous comments. Understanding which use cases they would and wouldn't solve would help clarify.

Sorry for the delayed response. I don't really have much to add beyond my previous message. Basically I need to be able to cancel a transition. So if I have A -> B -> X, and C -> X, with a transition between A and B, I need to be able to pop the transition config state so that B and C depend on the same (sole) build of X. Maybe this has been addressed in the meantime, it's been a while since I dealt with this use case.

fmeum commented 3 years ago

Sorry for the delayed response. I don't really have much to add beyond my previous message. Basically I need to be able to cancel a transition. So if I have A -> B -> X, and C -> X, with a transition between A and B, I need to be able to pop the transition config state so that B and C depend on the same (sole) build of X. Maybe this has been addressed in the meantime, it's been a while since I dealt with this use case.

As far as I can tell this may be possible to achieve today, but would definitely be cumbersome. Even if you reset the state after the transition with another transition, your two builds will differ by the value of the affected by starlark transition flag.

In order to get a single build, you would also have to build the previously untransitioned target through a chain of transitions that first change the same set of config settings and then reset them. Since Bazel composes subsequent transitions in a chain, this might only work of you separate them with a rule.

All of this would become much easier with the new transition logic mentioned above which would do away with the affected by starlark transition field.

bazelbuild / bazel

Output directory hash differs if the same build settings are applied via transition vs. command line #12171

Description of the problem / feature request:

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

rules_ios VS rules_apple transition deltas

Redundant builds of ios_application deps due to no transitions

What's the output of `bazel info release`?

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

What's the output of `git remote get-url origin ; git rev-parse master ; git rev-parse HEAD` ?