Closed anhlinh123 closed 9 months ago
Good writeup thank you. This is blocking our upgrade to Bazel 7.0.0. It also occurs with these flags set:
coverage --experimental_fetch_all_coverage_outputs
coverage --experimental_split_coverage_postprocessing
which we set due to requiring nobuild_runfile_links
.
@anhlinh123 @sputt If you run with --verbose_failures
, are you able to get the stack trace leading up to the I/O exception during sandboxed execution
exception? That would help narrow down the issue.
we ran into this issue with py_test
too. During migration to bazel 7.0.0, when we try to run coverage with py_test, we have this error
I/O exception during sandboxed execution: Input is a directory: bazel-out/aarch64-fastbuild/testlogs/tests/ci/release_stable/release_stable_lint_flake8_test/_coverage
our coverage options:
test --experimental_fetch_all_coverage_outputs
test --remote_download_minimal
test --experimental_split_coverage_postprocessing
coverage --combined_report=lcov
@bazel-io flag
@iancha1992 This looks like another 7.0.0 regression, just want to make sure it's tracked.
@fmeum @Wyverald @meteorcloudy Should this be included in the 7.0.1 patch? or is it okay to go straight to 7.1.0?
cc: @bazelbuild/triage
Since this is a regression in 7.0.0, ideally we should include a fix in 7.0.1. But that depends on the timeline. @tjgq do you have an estimate for how long the fix might take?
@bazel-io fork 7.0.1
@bazel-io fork 7.1.0
@nlou9 Same request as above: can you please run with --verbose_failures
and post the full stack trace here?
Are you setting --experimental_split_coverage_postprocessing
by any chance?
If so, I suspect that the root cause is the same as the coverage failure in #20753 and if so, have a fix for it that is in the process of being submitted (although CI is not really a happy camper now so it will take a bit of time to materialize at HEAD)
@lberki we are using --experimental_split_coverage_postprocessing
as I mentioned above
@tjgq, here is the log
bazel coverage //...Failed in
02:17
2024/01/09 02:29:07 Downloading https://releases.bazel.build/7.0.0/release/bazel-7.0.0-linux-arm64...00:01
2024/01/09 02:29:07 Skipping basic authentication for releases.bazel.build because no credentials found in /home/semaphore/.netrc00:01
Extracting Bazel installation...00:01
Starting local Bazel server and connecting to it...00:07
(02:29:13) INFO: Invocation ID: d0d3983c-0fcd-4822-974a-8cda4a8476d500:08
(02:29:13) INFO: Options provided by the client:00:08
00:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/.bazelrc:00:08
Inherited 'common' options:
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/ci-tools/.bazelrc:00:08
Inherited 'build' options: --remote_retries=2 --remote_timeout=7200 --java_runtime_version=remotejdk_11 --enable_bzlmod=false00:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/.bazelrc:00:08
Inherited 'build' options: --announce_rc --show_timestamps --show_progress_rate_limit=60 --curses=no --remote_download_minimal00:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/ci-tools/.bazelrc:00:08
Inherited 'test' options: --test_output=errors --test_timeout=-1,-1,-1,720000:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/.bazelrc:00:08
Inherited 'test' options: --test_output=errors --experimental_fetch_all_coverage_outputs --experimental_split_coverage_postprocessing00:08
(02:29:13) INFO: Reading rc options for 'coverage' from /home/semaphore/ci-tools/.bazelrc:00:08
'coverage' options: --combined_report=lcov --verbose_failures --instrumentation_filter=^// --instrumentation_filter=^//confluent[/:],-.*(test|tests|lint)00:08
(02:29:13) INFO: Current date is 2024-01-0900:08
(02:29:13) Computing main repo mapping: 00:08
(02:29:18) DEBUG: /home/semaphore/.cache/bazel/_bazel_semaphore/c2e391e2c21d1440479c3b26017f505f/external/rules_python/python/pip.bzl:47:10: pip_install is deprecated. Please switch to pip_parse. pip_install will be removed in a future release.00:13
(02:29:19) Loading: 00:13
(02:29:19) Loading: 0 packages loaded00:13
(02:29:20) Analyzing: 324 targets (26 packages loaded, 0 targets configured)00:14
(02:29:20) Analyzing: 324 targets (26 packages loaded, 0 targets configured)00:14
[0 / 1] checking cached actions00:14
(02:30:21) Analyzing: 324 targets (199 packages loaded, 10070 targets configured)01:16
[1 / 2] checking cached actions01:16
(02:30:32) INFO: Analyzed 324 targets (219 packages loaded, 11036 targets configured).01:26
(02:31:21) [622 / 741] Creating runfiles tree bazel-out/aarch64-fastbuild/bin/tests/ci/release_stable/release/test_builder_first_rc_from_repo.runfiles; 45s local ... (112 actions, 93 running)02:16
(02:31:23) ERROR: /home/semaphore/ci-tools/tests/ci/scripts/BUILD.bazel:44:8: Testing //tests/ci/scripts:test_update_version_integration_lint_flake8_test failed: I/O exception during sandboxed execution: Input is a directory: bazel-out/aarch64-fastbuild/testlogs/tests/ci/scripts/test_update_version_integration_lint_flake8_test/_coverage02:17
(02:31:23) INFO: Elapsed time: 135.863s, Critical Path: 47.72s
Then I believe https://github.com/iancha1992/bazel/commit/9463a4f50981785da493097e1197799c525b6278 is the fix (it'll be both in 7.0.1 and 7.1.0)
@lberki wonder the ETA of 7.0.1 or 7.1.0?
I think the idea is that 7.0.1 should be out sometime next week the latest (don't take this as a promise until @meteorcloudy confirms)
Also, @tjgq suspects that there is another bug lurking in the deep that could cause your build to fail like this and he was working on confirming or denying that. My guess is that the above change is enough, but that's just a guess and an uninformed one.
If we have confirmation that the repro requires --experimental_split_coverage_postprocessing
, I believe https://github.com/bazelbuild/bazel/commit/b0db044227d62178a7e578b8e03c452d8c17af33 will fix it. (Otherwise, if it's supposed to repro without --experimental_split_coverage_postprocessing
, I couldn't do so following the instructions above.)
Thanks, your mention of AbstractActionInputPrefetcher
scared me, but if that's not involved in this breakage, I'm relieved .
I see this on python tests as well. This issue blocks my ability to upgrade Bazel.
@tjgq Sorry for the late reply. This is the error message when using --verbose_failures:
ERROR: /<path_to_the_test>/BUILD:50:8: Testing //<path>:UnitTest failed: I/O exception during sandboxed execution: 11 errors during bulk transfer:
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
com.google.devtools.build.lib.actions.DigestOfDirectoryException: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path>/UnitTest/_coverage
I don't think there will be a stacktrace as there is a function consuming all Exceptions. I don't remember what that is since it's been a while. Let me try to recall.
Ok, so it all started with this function: https://github.com/bazelbuild/bazel/blob/7.0.0/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java#L82 which after all consumes all IOException: https://github.com/bazelbuild/bazel/blob/7.0.0/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java#L142
But the critical point seems to be here: https://github.com/bazelbuild/bazel/blob/7.0.0/src/main/java/com/google/devtools/build/lib/remote/AbstractActionInputPrefetcher.java#L418 where it tries to fetch metadata of a tree artifact. I'm not sure why it does that. But at that point, the tree artifact is a directory (I guess that is because the content of the directory was already evicted from the remote cache), and it breaks this code https://github.com/bazelbuild/bazel/blob/7.0.0/src/main/java/com/google/devtools/build/lib/exec/SingleBuildFileCache.java#L78
My repo also uses --experimental_split_coverage_postprocessing
.
But I believe it doesn't need that flag to reproduce the error (if the flag is disabled by default).
It can be easily reproduce locally by this
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Use disk_cache instead of remote_cache to easily manipulate the cache. (add --disk_cache=<path_to_local_cache> to the command line).
Run a cc_test: bazel coverage <test> --disk_cache=<cache> --remote_upload_local_results=true --execution_log_json_file=<file_name>.
Open the json file, find the action that has "commandArgs": ["external/bazel_tools/tools/test/collect_coverage.sh"],
Find the hash of the action output (near the end of the action object).
Delete the file corresponding to the hash value in the cache directory.
Rerun the test.
The test fails with the error: I/O exception during sandboxed execution: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path_to_the_test_dir>/_coverage.
@anhlinh123 Do you mind giving https://github.com/bazelbuild/bazel/commit/b0db044227d62178a7e578b8e03c452d8c17af33 a try? (The simplest way is to use Bazelisk with USE_BAZEL_VERSION=last_green
.) That would tell us whether there's indeed a separate issue that that commit didn't fix.
Otherwise, I think there might be something missing from your repro steps. You are building once, deleting a file from the disk cache, then rebuilding incrementally. I'd thus expect the incremental rebuild to be a no-op, since the output tree hasn't been touched and Bazel can tell it's up-to-date (and that's also what I'm seeing experimentally.) Is the second build a clean build? Are there any other flags in a .bazelrc? (Use --announce_rc
to print all of the flags Bazel is using.)
(It's USE_BAZEL_VERSION=last_green
, not USE_BAZEL_VERSION=latest
. I've amended the previous comment.)
@tjgq This is what I've found so far.
--experimental_split_coverage_postprocessing
actually matters. Turning it off did fix the error.That's great to hear, thanks. I will close this issue, since the fix has also been cherry-picked into the 7.0.1 branch (in https://github.com/bazelbuild/bazel/pull/20819).
@tjgq thank you for your support!
@tjgq The version 7.0.1rc1 has a weird bug related to this.
Turning --experimental_split_coverage_postprocessing
on always breaks coverage
.
ERROR: <path>/test/BUILD:1:8: Testing //:test failed: I/O exception during sandboxed execution: Input is a directory: bazel-out/k8-fastbuild/testlogs/test/_coverage
The last_green
doesn't have it.
7.0.1rc1 was cut before the fix made it into the 7.0.1 branch. Can you try USE_BAZEL_VERSION=f4da34dcfe7b83388e3d963f35581a4fe710fc14
(current tip of the branch)?
@tjgq It works!!! Thank you!
Description of the bug:
Running coverage on a cc_test randomly fails with the following error:
I/O exception during sandboxed execution: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path_to_the_test_dir>/_coverage
Afaik, the condition under which the bug happens might be:Which category does this issue belong to?
Remote Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
bazel coverage <test> --disk_cache=<cache> --remote_upload_local_results=true --execution_log_json_file=<file_name>
."commandArgs": ["external/bazel_tools/tools/test/collect_coverage.sh"],
I/O exception during sandboxed execution: Input is a directory: bazel-out/k8-fastbuild/testlogs/<path_to_the_test_dir>/_coverage
.Which operating system are you running Bazel on?
Ubuntu 20.04.6 LTS
What is the output of
bazel info release
?release 7.0.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
Probably it happens after this commit 1267631fee7523ea3ae572c9b8c9c1fc9c8c9452, when it tries to fetch metadata of a TreeArtifact (the _coverage directory).
Have you found anything relevant by searching the web?
No.
Any other information, logs, or outputs that you want to share?
No response