bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.08k stars 4.04k forks source link

Build graph invalidation with custom flags from external repositories #17853

Open layus opened 1 year ago

layus commented 1 year ago

Description of the bug:

To fix issues with rules_docker transitions leaking into our own transitions and multiplicating the number of different configurations, we have added build --@io_bazel_rules_docker//transitions:enable=no to our .bazelrc, as per https://github.com/bazelbuild/rules_docker/pull/2068

That works, but introduced a subtle and hard to track issue where running some bazel commands like cquery, dump or config would in some cases invalidate large parts of the daemon's build graph. I started investigating several months ago, and saw a breakthrough only recently.

It seems that these less frequently used commands have trouble taking that value into account. There is also some history with workspace label prefixes in custom build flags. See in general https://github.com/bazelbuild/bazel/issues/11128 and https://github.com/bazelbuild/bazel/issues/9177 for historical, somewhat related issues.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This works for me, be I still have to provide a minimal repo where it works:

echo "build --@io_bazel_rules_docker//transitions:enable=no" >> .bazelrc
# Get a proper build graph in cache
bazel build //...
# List current configurations
bazel config
# Invalidate the cache with a command that does no understand/use the above flag
bazel info output_path
# Observe that the set of cached configurations has changed
bazel config

Which operating system are you running Bazel on?

Ubuntu Linux 20.04 (+nix)

What is the output of bazel info release?

release 6.0.0- (@non-git)

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

Its bazel 6.0.0 from nixpkgs.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

See description above

Any other information, logs, or outputs that you want to share?

No response

layus commented 1 year ago

Took time to make a proper reproduction procedure: https://gist.github.com/layus/57fd0aeba78a9a67e8e684379f829e4d. I discovered several thing doing so:

  1. Bazel 5 had a warning when running bazel info: WARNING: info command does not support starlark options. Ignoring options: [--@io_bazel_rules_docker//transitions:enable=no]. That warning is gone in bazel 6, but that specific option is still ignored. My guess is because it references an external repo, but that remains to be proven.
  2. On that small example, the graph is not immediately pruned. I suspect this happens on our monorepo due to it's size (600k actions on a full build). Is there some gc of the graph happening above a memory threshold ?
  3. This makes the script simpler, as previous configs are not garbage collected. I am even able to compare them with bazel config $bad $good. We see that the only difference is said flag which is incorrectly omitted when calling bazel info : https://gist.github.com/layus/57fd0aeba78a9a67e8e684379f829e4d#file-output-log-L72-L76

    Displaying diff between configs 35b359 and fbc5f0
    FragmentOptions user-defined {
      @io_bazel_rules_docker//transitions:enable: null, false
    }
layus commented 1 year ago

This in turns hints at more related issues:

https://github.com/bazelbuild/bazel/issues/13473, where this comment https://github.com/bazelbuild/bazel/issues/13473#issuecomment-1324274683 happens to be incorrect:

[...] The info and clean commands no longer warn, and starlark flags have no effect on these commands. We discover that while starlark flags have no effects on the output of these commands, running these commands has an impact on the daemon state. The incomplete configuration that these commands generate for their own use bump the internal graph version, and allow correct and needed nodes to be garbage collected.

This happens to be an issue because it discards useful state from the daemon. Regenerating that data completely takes more than one minute. bazel info (and bazel {c,}query to some extent) should not have that effect. Outside of the inefficiency, it is extremely surprising to have a sequence like bazel cquery; bazel info; bazel config <some config just printed> fail because the config is not present anymore in the cache.

Data is not retained for as long as naively expected.

layus commented 1 year ago

I have a rough fix that "works for me" in https://github.com/bazelbuild/bazel/pull/17875. Mostly POC and WIP and untested in the parsing error branch, but usable and functional.

aranguyen commented 1 year ago

@layus as you have discovered already in https://github.com/bazelbuild/bazel/pull/17875, commands like info and clean do not take starlark options into consideration. The reasoning behind this current restriction is because The info and clean commands have builds=true set in their annotation but don't actually do any building and starlark option loading might not work in these cases. The configuratbility team no longer owns this part of the code base so I'm routing this to the appropriate team for their input.

gregestren commented 1 year ago

Regenerating that data completely takes more than one minute

Which data specifically do you mean?

I see the effect of bazel config producing different results. But I'm not sure the actual build graph itself gets invalidated because of this.

One other debugging tool to look at this is bazel dump --skyrame=count: this literaly dumps the state of the whole build graph on any call.

Also, I'd expect cquery to be completely consistent with build. Are you seeing otherwise?

layus commented 1 year ago

I am not able to dig that issue right now, probably in the coming days or weeks. AFAIR the issue is that BASELINE_CONFIGURATION is changed, and that "invalidates" the whole graph. Restoring it to the normal, original value on the next build then needs to check the validity of all nodes. This basically means a traversal of all the graph nodes.