Closed gergelyfabian closed 1 month ago
Workaround
When returning to the proper branch, instead of running "bazel clean", it is enough to run:
rm -rf bazel-out/k8-fastbuild/testlogs/
Then the test results will be properly taken from remote cache. I'd consider doing this always before running tests (e.g. in a CICD tool).
/cc @meteorcloudy
I think this is just a display issue - when Bazel gets a test result from the disk cache, it incorrectly doesn't mark it as a cache hit even though it is. Try adding a sleep 20 to the test and see how long it takes?
I added a sleep 20s to the test (not yet committed), and yes, it's not just a display issue, it is indeed waiting 20s when running the test again on the "good" branch.
If I delete testlogs, then it's not waiting 20s and has the results right away.
Just a clarification... I think this is not a bug in remote cache (because the remote cache stores good results), but rather somewhere in the "communication" between the sandbox and the remote cache. Why is the previous failed test result in the sandbox "masking" the good result from the remote cache?
Thanks for reporting this! I'll look into the cause.
One addition. If I have two branches, that differ in their test code, but tests are passing on both branches, assuming the remote cache is already populated, then the behavior is the following:
It seems that it's only the fail that somehow "tags" a failed test target and forces it to be rebuilt when switching back to the "good" branch. This can be avoided by the workaround to run "bazel clean" or delete the testlogs.
@gergelyfabian Thanks for the detailed info! I can also reproduce this with a very simple case: https://github.com/meteorcloudy/my_tests/tree/master/test_result_cache_test, both on Windows and Linux. Investigating now..
I think the problematic logic is here: https://github.com/bazelbuild/bazel/blob/a9b118bbe1d7c6c4be4524547dddc1fc21ba60ad/src/main/java/com/google/devtools/build/lib/analysis/test/TestRunnerAction.java#L429
Here if Bazel detects previous test run failed, it will unconditionally rerun the test without taking any cache into consideration.
rm bazel-out/k8-fastbuild/testlogs/
worked because it removed the previous test status.
Let me see how can we optimize the logic.
Update: If we simply remove this check, we'll cause failed test to be cached.
Removing the check isn't the right solution. I think the issue is that we use that to also toggle a bit in the remote caching protocol on whether we accept a cache hit or not. If we assume that the remote cache never caches failed actions, then we don't need to set that bit. However, we do need to set the bit if --cache_test_results=no
is set.
To clarify - the check in TestRunnerAction is also applied to the local cache, where we do want this behavior - at least that's what the flag documentation for cache_test_results
says. You could work around by setting --cache_test_results=yes
.
But --cache_test_results=yes
will also cause failed test not to rerun anymore.
@ulfjack Can you tell me where do we toggle the bit for remote caching protocol? I'm not quite sure how we decide a test will be cached by remote caching or not.
Anyway, that's the class that's primarily responsible.
Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage
) if you think this issue is still relevant or you are interested in getting the issue resolved.
This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team (@bazelbuild/triage
). Thanks!
@bazelbuild/triage can this issue be reopened? it's a big issue when bazel is used for large integration tests in a CI with persistent workers.
Any code change that goes through CI can affect code in a way that breaks one or many large tests. Because of this bug, it also leaves the worker in a state in which it will not reuse test results from remote cache when moving on to work on the CI run for another, completely unrelated PR.
Wow I just observed this issue when preparing a Bazel demo. This explains some weird cache misses in our CI as well. Since this discussion the implementation of maybeReadCacheStatus has changed which breaks the workaround of deleting test.cache_status files. Not loading a cached test result when the previous local run has failed, but code modifications have been reverted, is most definitely a bug.
I feel like the fix should be relatively easy as TestRunnerAction.shouldCacheResult should never consider the previous TestResultData as it is invoked for a new TestRunnerSpawn. Could you confirm my thought @meteorcloudy @ulfjack ? I'm willing to work on a PR for this.
@tjgq could you verify my thoughts above?
I feel like the fix should be relatively easy as TestRunnerAction.shouldCacheResult should never consider the previous TestResultData as it is invoked for a new TestRunnerSpawn.
Did you mean "shouldCacheResult should ignore the previous TestResultData if it is invoked with a different TestRunnerSpawn than the one that produced the TestResultData"? Conceptually I agree, but I'm not sure how to translate that into code without keeping around a reference to the TestRunnerSpawn (which might be a bad idea).
It's also worth keeping in mind that, since the status file is persisted to disk, any solutions that rely on keeping additional in-memory state won't fix the issue across server restarts. For that, we'd need to store something in the status file that captures the identity of the test (e.g. something like j.c.g.devtools.build.lib.remote.ActionKey
, although that particular choice is architecturally awkward because the core test machinery should not depend on the remote
package which is mostly Bazel-specific).
I also believe this cache bypass logic would not be needed at all if we could ensure that a failed action can never be stored in a disk or remote cache. This is already the case for a disk cache or for a remote cache that only Bazel writes to (i.e., no remote execution, nor clients other than Bazel), because Bazel will never upload a failed action. But it's an issue for remote execution (or a remote cache that may have been previously used with remote execution) because the protocol is silent on whether failed actions are cached, and doesn't give clients a way to ask one way or the other (they can only force unconditional reexecution of a cached action, not reexecution only in the case of a cached failure).
Hey @tjgq, I'm probably missing a lot here but shouldn't we not simply change the implementation of computeExecuteUnconditionallyFromTestStatus
? Something like this:
https://github.com/dhlparcel/bazel/pull/1/files
Just joining the conversation because I run into this issue at work as well and there is finally some more activity on this issue. 👍 . In the past we simply removed the testlogs as mentioned earlier in this issue, but that doesn't seem to work anymore.
Maybe, but since I don't fully understand the reason for the existence of this logic (my best guess is what I've described in https://github.com/bazelbuild/bazel/issues/11057#issuecomment-2262971292 but it's just that, a guess) I'm worried that removing it might have adverse effects. Let me do a bit of archaeology and get back to you.
For future reference, here's a super tiny repro:
cat >BUILD <<EOF
sh_test(srcs = ["test.sh"])
EOF
cat >.bazelrc <<EOF
build --disk_cache=/tmp/disk_cache
EOF
touch test.sh && chmod +x test.sh
echo "exit 0" > test.sh
bazel test :test # build successful, disk cache miss
bazel test :test # build successful, disk cache hit
echo "exit 1" > test.sh
bazel build :test # build failure, disk cache miss
echo "exit 0" > test.sh
bazel test :test # build successful, disk cache miss (but expected disk cache hit)
After some spelunking I think I know what the correct fix looks like. Expect a commit soon.
@bazel-io flag
@bazel-io fork 7.4.0
Description of the problem / feature request:
Assume we have two branches of a git repository. master with passing unit tests and a unit_test_failure branch with failing unit tests. I have disk_cache enabled. I run:
$ git checkout master $ bazel test //... -> unit tests pass. No cache hit. $ bazel test //... -> unit tests pass. Cache hit. $ git checkout unit_test_failure $ bazel test //... -> unit tests fail $ git checkout master $ bazel test //... -> unit tests pass. No cache hit.
If instead, I add "rm -rf bazel-out/k8-fastbuild/testlogs/" right before running the last "bazel test" on master (or "bazel clean"), then there will be a remote cache hit:
$ git checkout master $ bazel test //... -> unit tests pass. No cache hit. $ bazel test //... -> unit tests pass. Cache hit. $ git checkout unit_test_failure $ bazel test //... -> unit tests fail $ git checkout master
Run "bazel clean" or "rm -rf bazel-out/k8-fastbuild/testlogs/"
$ rm -rf bazel-out/k8-fastbuild/testlogs/ $ bazel test //... -> unit tests pass. Cache hit.
I would expect a remote cache hit also in the first case. But this does not happen.
I suppose this is a bug. If not, please let me know what I'm doing wrong.
Feature requests: what underlying problem are you trying to solve with this feature?
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
See in Description. Please use my repo on github: https://github.com/gergelyfabian/bazel-scala-example. The example branches I used in this description (master and unit_test_failure) are there and testable.
What operating system are you running Bazel on?
Ubuntu 18.04.
What's the output of
bazel info release
?INFO: Invocation ID: a2999d7d-0b72-4f0f-b42d-9590c7ea672f release 2.2.0
Tested on Bazel 2.0.0 and 2.2.0.
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?git@github.com:gergelyfabian/bazel-scala-example.git e1e4ae4da5a3d8c18a68e1c8c6baddb34d187659 9b4a64e1be19b28ff73941446d6729cd98dcd8cc
Have you found anything relevant by searching the web?
There is another issue that seems similar: #9389.
Any other information, logs, or outputs that you want to share?
Case reproducing the bug:
$ git checkout master Already on 'master' Your branch is up to date with 'origin/master'. $ bazel test //... INFO: Invocation ID: 92696c4f-639f-4e5b-87b7-f22adcfd8d3a INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... INFO: Elapsed time: 0.167s, Critical Path: 0.00s INFO: 0 processes. INFO: Build completed successfully, 1 total action //example-lib:test (cached) PASSED in 0.0s //example-maven:test (cached) PASSED in 0.0s
Executed 0 out of 2 tests: 2 tests pass. INFO: Build completed successfully, 1 total action $ git checkout unit_test_failure
Switched to branch 'unit_test_failure' $ bazel test //... INFO: Invocation ID: 1eef1f1d-7df4-4d3d-ac92-97e638fe17e9 INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... FAIL: //example-maven:test (see /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log) INFO: Elapsed time: 1.233s, Critical Path: 1.07s INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox. INFO: Build completed, 1 test FAILED, 3 total actions //example-lib:test (cached) PASSED in 0.0s //example-maven:test FAILED in 1.0s /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log
Executed 1 out of 2 tests: 1 test passes and 1 fails locally. INFO: Build completed, 1 test FAILED, 3 total actions $ git checkout master Switched to branch 'master' Your branch is up to date with 'origin/master'. $ bazel test //... INFO: Invocation ID: 3a21e211-1902-4433-b344-9f04b3d83558 INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... INFO: Elapsed time: 1.240s, Critical Path: 1.07s INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox. INFO: Build completed successfully, 3 total actions //example-lib:test (cached) PASSED in 0.0s //example-maven:test PASSED in 1.0s
Executed 1 out of 2 tests: 2 tests pass. INFO: Build completed successfully, 3 total actions
Forcing remote cache hit:
$ git checkout master Already on 'master' Your branch is up to date with 'origin/master'. $ $ bazel test //... INFO: Invocation ID: 71339ed1-128c-412c-a6cd-6e812b38a880 INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... INFO: Elapsed time: 0.234s, Critical Path: 0.01s INFO: 0 processes. INFO: Build completed successfully, 1 total action //example-lib:test (cached) PASSED in 0.0s //example-maven:test (cached) PASSED in 1.0s
Executed 0 out of 2 tests: 2 tests pass. INFO: Build completed successfully, 1 total action $ git checkout unit_test_failure
Switched to branch 'unit_test_failure' $ bazel test //... INFO: Invocation ID: ec056e85-d024-4072-916b-d89fd017b278 INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... FAIL: //example-maven:test (see /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log) INFO: Elapsed time: 1.285s, Critical Path: 1.04s INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox. INFO: Build completed, 1 test FAILED, 3 total actions //example-lib:test (cached) PASSED in 0.0s //example-maven:test FAILED in 1.0s /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log
Executed 1 out of 2 tests: 1 test passes and 1 fails locally. INFO: Build completed, 1 test FAILED, 3 total actions $ git checkout master Switched to branch 'master' Your branch is up to date with 'origin/master'. $ bazel clean INFO: Invocation ID: 0cbb6d3d-a42d-41c2-9ece-dde8feb7d795 INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes. $ bazel test //... INFO: Invocation ID: afbf5193-0f6a-4b0e-8b6d-a301aee6ba11 INFO: Analyzed 5 targets (58 packages loaded, 1299 targets configured). INFO: Found 3 targets and 2 test targets... INFO: From Building external/com_google_protobuf/libprotobuf_java.jar (123 source files, 1 source jar): warning: -parameters is not supported for target value 1.7. Use 1.8 or later. INFO: Elapsed time: 5.215s, Critical Path: 2.67s INFO: 212 processes: 212 remote cache hit. INFO: Build completed successfully, 245 total actions //example-lib:test (cached) PASSED in 0.0s //example-maven:test (cached) PASSED in 0.0s
Executed 0 out of 2 tests: 2 tests pass. INFO: Build completed successfully, 245 total actions