bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.89k stars 4.01k forks source link

Fatal error resolving a toolchain when using bzlmod #18567

Open cgrindel opened 1 year ago

cgrindel commented 1 year ago

Description of the bug:

Using Bazel 6.2.1, when executing bazel test //... in a child workspace using rules_bazel_integration_test, I see a fatal error (see below). Interestingly, when I just cd into the child workspace and run bazel test //... from the command line, it does not fail.

Fatal error:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.IllegalStateException: Value for: 'ToolchainContextKey{configurationKey=BuildConfigurationKey[8e16d2cfa11e2db6ed6cf6f2d7b88e102d4e16bf0b47fc5f72d75369ff9270dc], toolchainTypes=[ToolchainTypeRequirement{toolchainType=@rules_swift_tidy~override//swiftformat:toolchain, mandatory=true}], execConstraintLabels=[], forceExecutionPlatform=Optional.empty, debugTarget=false}' was missing, this should never happen
    at com.google.devtools.build.lib.bugreport.BugReport.sendBugReport(BugReport.java:182)
    at com.google.devtools.build.lib.bugreport.BugReport.logUnexpected(BugReport.java:153)
    at com.google.devtools.build.lib.skyframe.ConfiguredTargetFunction.computeUnloadedToolchainContexts(ConfiguredTargetFunction.java:736)
    at com.google.devtools.build.lib.skyframe.ConfiguredTargetFunction.computeUnloadedToolchainContexts(ConfiguredTargetFunction.java:642)
    at com.google.devtools.build.lib.skyframe.ConfiguredTargetFunction.compute(ConfiguredTargetFunction.java:296)
    at com.google.devtools.build.skyframe.ParallelEvaluator.bubbleErrorUp(ParallelEvaluator.java:427)
    at com.google.devtools.build.skyframe.ParallelEvaluator.waitForCompletionAndConstructResult(ParallelEvaluator.java:216)
    at com.google.devtools.build.skyframe.ParallelEvaluator.doMutatingEvaluation(ParallelEvaluator.java:182)
    at com.google.devtools.build.skyframe.ParallelEvaluator.eval(ParallelEvaluator.java:677)
    at com.google.devtools.build.skyframe.InMemoryMemoizingEvaluator.evaluate(InMemoryMemoizingEvaluator.java:203)
    at com.google.devtools.build.lib.skyframe.SkyframeExecutor.configureTargets(SkyframeExecutor.java:2217)
    at com.google.devtools.build.lib.skyframe.SkyframeBuildView.configureTargets(SkyframeBuildView.java:359)
    at com.google.devtools.build.lib.analysis.BuildView.update(BuildView.java:394)
    at com.google.devtools.build.lib.buildtool.AnalysisPhaseRunner.runAnalysisPhase(AnalysisPhaseRunner.java:233)
    at com.google.devtools.build.lib.buildtool.AnalysisPhaseRunner.execute(AnalysisPhaseRunner.java:139)
    at com.google.devtools.build.lib.buildtool.BuildTool.buildTargets(BuildTool.java:180)
    at com.google.devtools.build.lib.buildtool.BuildTool.processRequest(BuildTool.java:494)
    at com.google.devtools.build.lib.buildtool.BuildTool.processRequest(BuildTool.java:462)
    at com.google.devtools.build.lib.runtime.commands.TestCommand.doTest(TestCommand.java:148)
    at com.google.devtools.build.lib.runtime.commands.TestCommand.exec(TestCommand.java:113)
    at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:625)
    at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:240)
    at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)
    at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)
    at io.grpc.Context$1.run(Context.java:566)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

Using Bazel 7.0.0-pre.20230517.4, it still fails, but with a different error:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.IllegalStateException: Unexpected exception: dep Dependency{label=//:swiftformat_fmt_main.swift, configuration=0747b6e24c727cd674aa08cd016281faf72dd24df799fa11a8640fa2ea8ed968, aspects=AspectCollection{[]}, transitionKeys=[], executionPlatformLabel=null} had null value, even though there were no values missing in the initial fetch. That means it had an unexpected exception type (not ConfiguredValueCreationException)
    at com.google.devtools.build.lib.bugreport.BugReport.sendBugReport(BugReport.java:183)
    at com.google.devtools.build.lib.bugreport.BugReport.logUnexpected(BugReport.java:154)
    at com.google.devtools.build.lib.skyframe.PrerequisiteProducer.resolveConfiguredTargetDependencies(PrerequisiteProducer.java:947)
    at com.google.devtools.build.lib.skyframe.PrerequisiteProducer.computeDependencies(PrerequisiteProducer.java:735)
    at com.google.devtools.build.lib.skyframe.PrerequisiteProducer.evaluate(PrerequisiteProducer.java:348)
    at com.google.devtools.build.lib.skyframe.ConfiguredTargetFunction.compute(ConfiguredTargetFunction.java:202)
    at com.google.devtools.build.skyframe.ParallelEvaluator.bubbleErrorUp(ParallelEvaluator.java:423)
    at com.google.devtools.build.skyframe.ParallelEvaluator.waitForCompletionAndConstructResult(ParallelEvaluator.java:212)
    at com.google.devtools.build.skyframe.ParallelEvaluator.doMutatingEvaluation(ParallelEvaluator.java:178)
    at com.google.devtools.build.skyframe.ParallelEvaluator.eval(ParallelEvaluator.java:676)
    at com.google.devtools.build.skyframe.InMemoryMemoizingEvaluator.evaluate(InMemoryMemoizingEvaluator.java:177)
    at com.google.devtools.build.lib.skyframe.SkyframeExecutor.configureTargets(SkyframeExecutor.java:2306)
    at com.google.devtools.build.lib.skyframe.SkyframeBuildView.configureTargets(SkyframeBuildView.java:344)
    at com.google.devtools.build.lib.analysis.BuildView.update(BuildView.java:445)
    at com.google.devtools.build.lib.buildtool.AnalysisPhaseRunner.runAnalysisPhase(AnalysisPhaseRunner.java:247)
    at com.google.devtools.build.lib.buildtool.AnalysisPhaseRunner.execute(AnalysisPhaseRunner.java:140)
    at com.google.devtools.build.lib.buildtool.BuildTool.buildTargets(BuildTool.java:178)
    at com.google.devtools.build.lib.buildtool.BuildTool.processRequest(BuildTool.java:503)
    at com.google.devtools.build.lib.buildtool.BuildTool.processRequest(BuildTool.java:471)
    at com.google.devtools.build.lib.runtime.commands.TestCommand.doTest(TestCommand.java:148)
    at com.google.devtools.build.lib.runtime.commands.TestCommand.exec(TestCommand.java:113)
    at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:625)
    at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:240)
    at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)
    at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)
    at io.grpc.Context$1.run(Context.java:566)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

Typically, when a build/test fails using rules_bazel_integration_test, it is related to a missing environment variable or configuration. Are there any external configuration values that might affect toolchain resolution?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Steps

  1. Clone https://github.com/cgrindel/rules_swiftformat.
  2. Checkout the toolchain_fatal_error_repro branch. git checkout toolchain_fatal_error_repro
  3. Run bazel test //examples:simple_test.

The test will fail with the fatal error.

Which operating system are you running Bazel on?

MacOS Ventura 13.3.1

What is the output of bazel info release?

release 6.2.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

NA

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

NA

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

cgrindel commented 1 year ago

cc: @katre

cgrindel commented 1 year ago

While trying to debug this issue, I have tried a couple of things.

First, I commented out the register_toolchains(...) in the parent workspace's MODULE.bazel. This produces the following expected error:

ERROR: /private/var/tmp/_bazel_chuck/b730909ecda4fec524e6c3133b58bfa9/execroot/_main/bazel-out/darwin-fastbuild/bin/examples/simple_test.runfiles/_main/examples/simple/BUILD.bazel:42:16: While resolving toolchains for target //:swiftformat_fmt_main.swift: No matching toolchains found for types @rules_swift_tidy~override//swiftformat:toolchain.

Next, instead of calling register_toolchains() in the parent workspace, I call it in the child workspace. This produces the same fatal error described in this issue.

rickeylev commented 1 year ago

I ran into this today, too. My error is similar as the first posted error about the toolchain key (except mine is about the python toolchain, since thats what I'm working with).

From my debugging, I'm not sure if this is toolchain-specific or more generally bzlmod. So cc @Wyverald.

So, to me, in combination with the different error with Bazel 7 posted, this seems to point to a bad interaction between bzlmod and bazel-in-bazel testing.

Note 1: In startup_options.cc, there's special logic that uses TEST_TMPDIR as the indicator that it's a bazel-in-bazel test, and then purposefully sets the output root to that. So unsetting TEST_TMPDIR might work, but I'm not really sure it's entirely kosher. There's probably a reason bazel it trying to detect itself running within itself (a log message mentions idle time, so maybe just about server lifetime?).

Note 2: there's quite a few env vars set in a test invocation. Another thought I had was maybe TEST_TMPDIR is just acting at the bazel-in-bazel marker, and it's actually another envvar that has a value bzlmod doesn't like? I dunno; i didn't exhaustively go through all the env vars.

rickeylev commented 1 year ago

katre asked me to repro this, but I no longer can. Super weird. @cgrindel are you able to repro still?

cgrindel commented 1 year ago

I am able to reproduce this on my MacOS laptop. I will try to do so on an Ubuntu VM.

cgrindel commented 1 year ago

I am able to reproduce this on an Ubuntu VM, as well.

@katre What would be the best way to help you reproduce this? I can try to figure out how to let you access the Ubuntu VM that is running in GCE. What do you think?

katre commented 1 year ago

Yes, let's try that. Thank you!

katre commented 1 year ago

Okay, with some help I can now debug this.

The underlying issue is that something is causing ExternalDepsException to be thrown during ConfiguredTargetFunction evaluation, but it's not being handled, which leads to this sort of crash.

So, open questions:

  1. What's throwing that exception, and where should it have been handled?
  2. Why is that exception being thrown in this case, anyway?

I suspect that once 1 is answered, 2 will be easier to answer.

Wyverald commented 1 year ago

hmm. all sorts of stuff in Bzlmod throw that exception. Which is probably not the right thing to do, but there was never a primer on how to do Skyframe exceptions when we started, so we kind of just winged it...

If it's being thrown from CTF, my guess is it came in via toolchain resolution. Could you check if either RegisteredToolchainsFunction or RegisteredExecutionPlatformsFunction is throwing that?

katre commented 1 year ago

See #18629 for my initial audit on what needs to be fixed.

cgrindel commented 1 year ago

@katre Thanks for digging into this!

haxorz commented 1 year ago

Interestingly, when I just cd into the child workspace and run bazel test //... from the command line, it does not fail.

That's because of the use of BugReport#logUnexpected at https://github.com/bazelbuild/bazel/blob/06992d2d0da3ab4628028cecbb3f3dc4965f9e88/src/main/java/com/google/devtools/build/lib/skyframe/PrerequisiteProducer.java#L877 (see the Javadoc and implementation).

There's probably a reason bazel it trying to detect itself running within itself (a log message mentions idle time, so maybe just about server lifetime?).

Another category of reason is: Trying to detect it's being run in a [integration] test so it can make stronger assertions. In the specific example above, it's upgrading a warning debug log line to a crash.

katre commented 1 year ago

Still dealing with bzlmod error handling, but here's the underlying error that's getting lost in this failure:

Error loading '@rules_swift_tidy~override//swiftformat:extensions.bzl' for module extensions, requested by /usr/local/google/home/jcater/.cache/bazel/_bazel_jcater/8d79a2ce3b6733649bb72a30e4b9639b/execroot/_main/bazel-out/k8-fastbuild/bin/examples/simple_test.runfiles/_main/examples/simple/MODULE.bazel:22:33: at /usr/local/google/home/jcater/.cache/bazel/_bazel_jcater/8d79a2ce3b6733649bb72a30e4b9639b/execroot/_main/_tmp/0d74cd122ff726b298f6134e0fe5e1d0/_bazel_jcater/374a05da1afc2fe9d20700880e5fe999/external/rules_swift_tidy~override/swiftformat/extensions.bzl:4:5: Label '@rules_swift_tidy~override//swiftformat/bzlmod:swift_tidy_tools.bzl' is invalid because 'swiftformat/bzlmod' is not a package; perhaps you meant to put the colon here: '@rules_swift_tidy~override//swiftformat:bzlmod/swift_tidy_tools.bzl'?: at /usr/local/google/home/jcater/.cache/bazel/_bazel_jcater/8d79a2ce3b6733649bb72a30e4b9639b/execroot/_main/_tmp/0d74cd122ff726b298f6134e0fe5e1d0/_bazel_jcater/374a05da1afc2fe9d20700880e5fe999/external/rules_swift_tidy~override/swiftformat/extensions.bzl:4:5: Label '@rules_swift_tidy~override//swiftformat/bzlmod:swift_tidy_tools.bzl' is invalid because 'swiftformat/bzlmod' is not a package; perhaps you meant to put the colon here: '@rules_swift_tidy~override//swiftformat:bzlmod/swift_tidy_tools.bzl'?
fmeum commented 9 months ago

@cgrindel Is there any way I can run the integration test with a custom local build of Bazel? I have a theory that I would like to verify :-)

cgrindel commented 9 months ago

@fmeum: @katre added support for Bazelisk specifications. Can you put it in a fork such that Bazelisk can find it?

If that is not doable, let me know. We can add a scheme to specify a local Bazel binary.

fmeum commented 9 months ago

@cgrindel I think I know how to set up the fork, but it looks like the integration tests expect an installer. How can I build one?

Error in download_and_extract: java.io.IOException: Error downloading [https://releases.bazel.build/fmeum/7.0.0/release/bazel-fmeum/7.0.0-installer-linux-x86_64.sh, https://github.com/bazelbuild/bazel/releases/download/fmeum/7.0.0/bazel-fmeum/7.0.0-installer-linux-x86_64.sh] to /home/fhenneke/.cache/bazel/_bazel_fhenneke/6130380623874795b8c55bd6c1e1452c/external/rules_bazel_integration_test~0.14.1~bazel_binaries~build_bazel_bazel_.bazelversion/temp909679435908492539/7.0.0-installer-linux-x86_64.sh.zip: GET returned 404 Not Found
cgrindel commented 9 months ago

I am not sure how to create an installer. I have implemented an extension for rules_bazel_integration_test that will allow you to use a local Bazel binary for an integration test. I will clean it up and put it up for review in the morning.

fmeum commented 9 months ago

I found out how to build an installer:

However, when I put fmeum/7.0.0 in .bazelversion, the Bazel binary is downloaded from [https://releases.bazel.build/fmeum/7.0.0/release/bazel-fmeum/7.0.0-installer-linux-x86_64.sh, https://github.com/bazelbuild/bazel/releases/download/fmeum/7.0.0/bazel-fmeum/7.0.0-installer-linux-x86_64.sh]. The latter link has a format that I don't think can be realized with GitHub, it should point to https://github.com/fmeum/bazel/releases/download/7.0.0/7.0.0-installer-linux-x86_64.sh instead.

cgrindel commented 9 months ago

@fmeum I am not sure about the installer. However, here is a PR that adds support for local Bazel binaries to rules_bazel_integration_test. With this you should be able to just point to a local Bazel binary and run a test.

fmeum commented 9 months ago

@cgrindel Thanks, this is very helpful.

With current master, I now get this error:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.IllegalStateException: Unexpected analysis error: ConfiguredTargetKey{label=//Foo:swiftformat_fmt_Message.swift, config=BuildConfigurationKey[1dc064e57578a93a5aa5103d8e149bf56cdc28cbb0bc08a87f70cf502281addf]} -> ErrorInfo{exception=com.google.devtools.build.lib.bazel.bzlmod.ExternalDepsException: Error loading '@rules_swift_tidy~override//swiftformat:extensions.bzl' for module extensions, requested by /home/fhenneke/.cache/bazel/_bazel_fhenneke/6130380623874795b8c55bd6c1e1452c/sandbox/linux-sandbox/3/execroot/_main/bazel-out/k8-fastbuild/bin/examples/simple_test.runfiles/_main/examples/simple/MODULE.bazel:22:33: at /home/fhenneke/.cache/bazel/_bazel_fhenneke/6130380623874795b8c55bd6c1e1452c/sandbox/linux-sandbox/3/execroot/_main/_tmp/0d74cd122ff726b298f6134e0fe5e1d0/_bazel_fhenneke/96c7cecf5c9472312cb722e979dec6c6/external/rules_swift_tidy~override/swiftformat/extensions.bzl:4:5: Label '@rules_swift_tidy~override//swiftformat/bzlmod:swift_tidy_tools.bzl' is invalid because 'swiftformat/bzlmod' is not a package; perhaps you meant to put the colon here: '@rules_swift_tidy~override//swiftformat:bzlmod/swift_tidy_tools.bzl'?: at /home/fhenneke/.cache/bazel/_bazel_fhenneke/6130380623874795b8c55bd6c1e1452c/sandbox/linux-sandbox/3/execroot/_main/_tmp/0d74cd122ff726b298f6134e0fe5e1d0/_bazel_fhenneke/96c7cecf5c9472312cb722e979dec6c6/external/rules_swift_tidy~override/swiftformat/extensions.bzl:4:5: Label '@rules_swift_tidy~override//swiftformat/bzlmod:swift_tidy_tools.bzl' is invalid because 'swiftformat/bzlmod' is not a package; perhaps you meant to put the colon here: '@rules_swift_tidy~override//swiftformat:bzlmod/swift_tidy_tools.bzl'?, cycles=[], isCatastrophic=false, isDirectlyTransient=false, isTransitivelyTransient=false}, ([ConfiguredTargetKey{label=//Foo:swiftformat_fmt_Message.swift, config=BuildConfigurationKey[1dc064e57578a93a5aa5103d8e149bf56cdc28cbb0bc08a87f70cf502281addf]}])
    at com.google.devtools.build.lib.bugreport.BugReport.sendBugReport(BugReport.java:196)
    at com.google.devtools.build.lib.bugreport.BugReport.logUnexpected(BugReport.java:167)
    at com.google.devtools.build.lib.skyframe.SkyframeErrorProcessor.logUnexpectedException(SkyframeErrorProcessor.java:766)
    at com.google.devtools.build.lib.skyframe.SkyframeErrorProcessor.logUnexpectedExceptionOrigin(SkyframeErrorProcessor.java:761)
    at com.google.devtools.build.lib.skyframe.SkyframeErrorProcessor.assertValidAnalysisOrExecutionException(SkyframeErrorProcessor.java:717)
    at com.google.devtools.build.lib.skyframe.SkyframeErrorProcessor.processErrors(SkyframeErrorProcessor.java:269)
    at com.google.devtools.build.lib.skyframe.SkyframeBuildView.analyzeAndExecuteTargets(SkyframeBuildView.java:777)
    at com.google.devtools.build.lib.analysis.BuildView.update(BuildView.java:291)
    at com.google.devtools.build.lib.buildtool.AnalysisAndExecutionPhaseRunner.runAnalysisAndExecutionPhase(AnalysisAndExecutionPhaseRunner.java:241)
    at com.google.devtools.build.lib.buildtool.AnalysisAndExecutionPhaseRunner.execute(AnalysisAndExecutionPhaseRunner.java:139)
    at com.google.devtools.build.lib.buildtool.BuildTool.buildTargetsWithMergedAnalysisExecution(BuildTool.java:305)
    at com.google.devtools.build.lib.buildtool.BuildTool.buildTargets(BuildTool.java:173)
    at com.google.devtools.build.lib.buildtool.BuildTool.processRequest(BuildTool.java:511)
    at com.google.devtools.build.lib.buildtool.BuildTool.processRequest(BuildTool.java:479)
    at com.google.devtools.build.lib.runtime.commands.TestCommand.doTest(TestCommand.java:163)
    at com.google.devtools.build.lib.runtime.commands.TestCommand.exec(TestCommand.java:116)
    at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:664)
    at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:244)
    at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)
    at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:621)
    at io.grpc.Context$1.run(Context.java:566)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

This does make sense: The ExternalDepsException is bubbled up to the top-level target that transitively requests the module extension that is in error and then hits this check. It looks like either ExternalDepsException has to be made another understood and handled root cause of a failure or it needs to be wrapped into some more generic exception type in e.g. RepositoryDelegatorFunction. @katre Do you have any recommendation on the right approach?

katre commented 9 months ago

Either some intermediate skyfunction (I don't know which one because I don't understand the skykey chain involved) needs to convert the ExternalDepsException to a DependencyEvaluationException, so it can be handled in `DependencyResolver, or a new catch for the ExternalDepsException should be added in that block.

I prefer the former: it makes sense to me that failure to evaluate external dependencies is a subset of problems during dependency evaluation.

(In fact, why is ExternalDepsException distinct from DependencyEvaluationException? Or at least a subclass?)