bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.99k stars 4.03k forks source link

Previous failing test masks correct result in remote cache #11057

Closed gergelyfabian closed 1 month ago

gergelyfabian commented 4 years ago

ATTENTION! Please read and follow:

  • if this is a question about how to build / test / query / deploy using Bazel, or a discussion starter, send it to bazel-discuss@googlegroups.com
  • if this is a bug or feature request, fill the form below as best as you can.

Description of the problem / feature request:

Assume we have two branches of a git repository. master with passing unit tests and a unit_test_failure branch with failing unit tests. I have disk_cache enabled. I run:

$ git checkout master $ bazel test //... -> unit tests pass. No cache hit. $ bazel test //... -> unit tests pass. Cache hit. $ git checkout unit_test_failure $ bazel test //... -> unit tests fail $ git checkout master $ bazel test //... -> unit tests pass. No cache hit.

If instead, I add "rm -rf bazel-out/k8-fastbuild/testlogs/" right before running the last "bazel test" on master (or "bazel clean"), then there will be a remote cache hit:

$ git checkout master $ bazel test //... -> unit tests pass. No cache hit. $ bazel test //... -> unit tests pass. Cache hit. $ git checkout unit_test_failure $ bazel test //... -> unit tests fail $ git checkout master

Run "bazel clean" or "rm -rf bazel-out/k8-fastbuild/testlogs/"

$ rm -rf bazel-out/k8-fastbuild/testlogs/ $ bazel test //... -> unit tests pass. Cache hit.

I would expect a remote cache hit also in the first case. But this does not happen.

I suppose this is a bug. If not, please let me know what I'm doing wrong.

Feature requests: what underlying problem are you trying to solve with this feature?

Replace this line with your answer.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

See in Description. Please use my repo on github: https://github.com/gergelyfabian/bazel-scala-example. The example branches I used in this description (master and unit_test_failure) are there and testable.

What operating system are you running Bazel on?

Ubuntu 18.04.

What's the output of bazel info release?

INFO: Invocation ID: a2999d7d-0b72-4f0f-b42d-9590c7ea672f release 2.2.0

Tested on Bazel 2.0.0 and 2.2.0.

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Replace this line with your answer.

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

git@github.com:gergelyfabian/bazel-scala-example.git e1e4ae4da5a3d8c18a68e1c8c6baddb34d187659 9b4a64e1be19b28ff73941446d6729cd98dcd8cc

Have you found anything relevant by searching the web?

There is another issue that seems similar: #9389.

Any other information, logs, or outputs that you want to share?

Case reproducing the bug:

$ git checkout master Already on 'master' Your branch is up to date with 'origin/master'. $ bazel test //... INFO: Invocation ID: 92696c4f-639f-4e5b-87b7-f22adcfd8d3a INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... INFO: Elapsed time: 0.167s, Critical Path: 0.00s INFO: 0 processes. INFO: Build completed successfully, 1 total action //example-lib:test (cached) PASSED in 0.0s //example-maven:test (cached) PASSED in 0.0s

Executed 0 out of 2 tests: 2 tests pass. INFO: Build completed successfully, 1 total action $ git checkout unit_test_failure
Switched to branch 'unit_test_failure' $ bazel test //... INFO: Invocation ID: 1eef1f1d-7df4-4d3d-ac92-97e638fe17e9 INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... FAIL: //example-maven:test (see /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log) INFO: Elapsed time: 1.233s, Critical Path: 1.07s INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox. INFO: Build completed, 1 test FAILED, 3 total actions //example-lib:test (cached) PASSED in 0.0s //example-maven:test FAILED in 1.0s /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log

Executed 1 out of 2 tests: 1 test passes and 1 fails locally. INFO: Build completed, 1 test FAILED, 3 total actions $ git checkout master Switched to branch 'master' Your branch is up to date with 'origin/master'. $ bazel test //... INFO: Invocation ID: 3a21e211-1902-4433-b344-9f04b3d83558 INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... INFO: Elapsed time: 1.240s, Critical Path: 1.07s INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox. INFO: Build completed successfully, 3 total actions //example-lib:test (cached) PASSED in 0.0s //example-maven:test PASSED in 1.0s

Executed 1 out of 2 tests: 2 tests pass. INFO: Build completed successfully, 3 total actions


Forcing remote cache hit:

$ git checkout master Already on 'master' Your branch is up to date with 'origin/master'. $ $ bazel test //... INFO: Invocation ID: 71339ed1-128c-412c-a6cd-6e812b38a880 INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... INFO: Elapsed time: 0.234s, Critical Path: 0.01s INFO: 0 processes. INFO: Build completed successfully, 1 total action //example-lib:test (cached) PASSED in 0.0s //example-maven:test (cached) PASSED in 1.0s

Executed 0 out of 2 tests: 2 tests pass. INFO: Build completed successfully, 1 total action $ git checkout unit_test_failure
Switched to branch 'unit_test_failure' $ bazel test //... INFO: Invocation ID: ec056e85-d024-4072-916b-d89fd017b278 INFO: Analyzed 5 targets (0 packages loaded, 0 targets configured). INFO: Found 3 targets and 2 test targets... FAIL: //example-maven:test (see /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log) INFO: Elapsed time: 1.285s, Critical Path: 1.04s INFO: 2 processes: 1 remote cache hit, 1 linux-sandbox. INFO: Build completed, 1 test FAILED, 3 total actions //example-lib:test (cached) PASSED in 0.0s //example-maven:test FAILED in 1.0s /home/user/.cache/bazel/_bazel_user/d4795b862c3f0b5e82d41ecc9ec237d2/execroot/scala_example/bazel-out/k8-fastbuild/testlogs/example-maven/test/test.log

Executed 1 out of 2 tests: 1 test passes and 1 fails locally. INFO: Build completed, 1 test FAILED, 3 total actions $ git checkout master Switched to branch 'master' Your branch is up to date with 'origin/master'. $ bazel clean INFO: Invocation ID: 0cbb6d3d-a42d-41c2-9ece-dde8feb7d795 INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes. $ bazel test //... INFO: Invocation ID: afbf5193-0f6a-4b0e-8b6d-a301aee6ba11 INFO: Analyzed 5 targets (58 packages loaded, 1299 targets configured). INFO: Found 3 targets and 2 test targets... INFO: From Building external/com_google_protobuf/libprotobuf_java.jar (123 source files, 1 source jar): warning: -parameters is not supported for target value 1.7. Use 1.8 or later. INFO: Elapsed time: 5.215s, Critical Path: 2.67s INFO: 212 processes: 212 remote cache hit. INFO: Build completed successfully, 245 total actions //example-lib:test (cached) PASSED in 0.0s //example-maven:test (cached) PASSED in 0.0s

Executed 0 out of 2 tests: 2 tests pass. INFO: Build completed successfully, 245 total actions

gergelyfabian commented 4 years ago

Workaround

When returning to the proper branch, instead of running "bazel clean", it is enough to run:

rm -rf bazel-out/k8-fastbuild/testlogs/

Then the test results will be properly taken from remote cache. I'd consider doing this always before running tests (e.g. in a CICD tool).

dslomov commented 4 years ago

/cc @meteorcloudy

ulfjack commented 4 years ago

I think this is just a display issue - when Bazel gets a test result from the disk cache, it incorrectly doesn't mark it as a cache hit even though it is. Try adding a sleep 20 to the test and see how long it takes?

gergelyfabian commented 4 years ago

I added a sleep 20s to the test (not yet committed), and yes, it's not just a display issue, it is indeed waiting 20s when running the test again on the "good" branch.

If I delete testlogs, then it's not waiting 20s and has the results right away.

gergelyfabian commented 4 years ago

Just a clarification... I think this is not a bug in remote cache (because the remote cache stores good results), but rather somewhere in the "communication" between the sandbox and the remote cache. Why is the previous failed test result in the sandbox "masking" the good result from the remote cache?

meteorcloudy commented 4 years ago

Thanks for reporting this! I'll look into the cause.

gergelyfabian commented 4 years ago

One addition. If I have two branches, that differ in their test code, but tests are passing on both branches, assuming the remote cache is already populated, then the behavior is the following:

  1. test on branch1, tests pass, remote cache hit.
  2. test on branch1, tests pass, local cache hit.
  3. test on branch2, tests pass, remote cache hit.
  4. test on branch2, tests pass, local cache hit.
  5. test on branch1, tests pass, remote cache hit.

It seems that it's only the fail that somehow "tags" a failed test target and forces it to be rebuilt when switching back to the "good" branch. This can be avoided by the workaround to run "bazel clean" or delete the testlogs.

meteorcloudy commented 4 years ago

@gergelyfabian Thanks for the detailed info! I can also reproduce this with a very simple case: https://github.com/meteorcloudy/my_tests/tree/master/test_result_cache_test, both on Windows and Linux. Investigating now..

meteorcloudy commented 4 years ago

I think the problematic logic is here: https://github.com/bazelbuild/bazel/blob/a9b118bbe1d7c6c4be4524547dddc1fc21ba60ad/src/main/java/com/google/devtools/build/lib/analysis/test/TestRunnerAction.java#L429

Here if Bazel detects previous test run failed, it will unconditionally rerun the test without taking any cache into consideration.

rm bazel-out/k8-fastbuild/testlogs/ worked because it removed the previous test status.

Let me see how can we optimize the logic.

meteorcloudy commented 4 years ago

Update: If we simply remove this check, we'll cause failed test to be cached.

ulfjack commented 4 years ago

Removing the check isn't the right solution. I think the issue is that we use that to also toggle a bit in the remote caching protocol on whether we accept a cache hit or not. If we assume that the remote cache never caches failed actions, then we don't need to set that bit. However, we do need to set the bit if --cache_test_results=no is set.

ulfjack commented 4 years ago

To clarify - the check in TestRunnerAction is also applied to the local cache, where we do want this behavior - at least that's what the flag documentation for cache_test_results says. You could work around by setting --cache_test_results=yes.

meteorcloudy commented 4 years ago

But --cache_test_results=yes will also cause failed test not to rerun anymore. @ulfjack Can you tell me where do we toggle the bit for remote caching protocol? I'm not quite sure how we decide a test will be cached by remote caching or not.

ulfjack commented 4 years ago

https://github.com/bazelbuild/bazel/blob/7aa74984e99d7b81824916000852dbf6e442502a/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java#L249

ulfjack commented 4 years ago

And here: https://github.com/bazelbuild/bazel/blob/7aa74984e99d7b81824916000852dbf6e442502a/src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java#L115

ulfjack commented 4 years ago

Anyway, that's the class that's primarily responsible.

github-actions[bot] commented 1 year ago

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

github-actions[bot] commented 1 year ago

This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team (@bazelbuild/triage). Thanks!

juanzolotoochin commented 7 months ago

@bazelbuild/triage can this issue be reopened? it's a big issue when bazel is used for large integration tests in a CI with persistent workers.

Any code change that goes through CI can affect code in a way that breaks one or many large tests. Because of this bug, it also leaves the worker in a state in which it will not reuse test results from remote cache when moving on to work on the CI run for another, completely unrelated PR.

tomdegoede commented 5 months ago

Wow I just observed this issue when preparing a Bazel demo. This explains some weird cache misses in our CI as well. Since this discussion the implementation of maybeReadCacheStatus has changed which breaks the workaround of deleting test.cache_status files. Not loading a cached test result when the previous local run has failed, but code modifications have been reverted, is most definitely a bug.

I feel like the fix should be relatively easy as TestRunnerAction.shouldCacheResult should never consider the previous TestResultData as it is invoked for a new TestRunnerSpawn. Could you confirm my thought @meteorcloudy @ulfjack ? I'm willing to work on a PR for this.

tomdegoede commented 1 month ago

@tjgq could you verify my thoughts above?

tjgq commented 1 month ago

I feel like the fix should be relatively easy as TestRunnerAction.shouldCacheResult should never consider the previous TestResultData as it is invoked for a new TestRunnerSpawn.

Did you mean "shouldCacheResult should ignore the previous TestResultData if it is invoked with a different TestRunnerSpawn than the one that produced the TestResultData"? Conceptually I agree, but I'm not sure how to translate that into code without keeping around a reference to the TestRunnerSpawn (which might be a bad idea).

It's also worth keeping in mind that, since the status file is persisted to disk, any solutions that rely on keeping additional in-memory state won't fix the issue across server restarts. For that, we'd need to store something in the status file that captures the identity of the test (e.g. something like j.c.g.devtools.build.lib.remote.ActionKey, although that particular choice is architecturally awkward because the core test machinery should not depend on the remote package which is mostly Bazel-specific).

tjgq commented 1 month ago

I also believe this cache bypass logic would not be needed at all if we could ensure that a failed action can never be stored in a disk or remote cache. This is already the case for a disk cache or for a remote cache that only Bazel writes to (i.e., no remote execution, nor clients other than Bazel), because Bazel will never upload a failed action. But it's an issue for remote execution (or a remote cache that may have been previously used with remote execution) because the protocol is silent on whether failed actions are cached, and doesn't give clients a way to ask one way or the other (they can only force unconditional reexecution of a cached action, not reexecution only in the case of a cached failure).

henkerik commented 1 month ago

Hey @tjgq, I'm probably missing a lot here but shouldn't we not simply change the implementation of computeExecuteUnconditionallyFromTestStatus? Something like this:

https://github.com/dhlparcel/bazel/pull/1/files

Just joining the conversation because I run into this issue at work as well and there is finally some more activity on this issue. 👍 . In the past we simply removed the testlogs as mentioned earlier in this issue, but that doesn't seem to work anymore.

tjgq commented 1 month ago

Maybe, but since I don't fully understand the reason for the existence of this logic (my best guess is what I've described in https://github.com/bazelbuild/bazel/issues/11057#issuecomment-2262971292 but it's just that, a guess) I'm worried that removing it might have adverse effects. Let me do a bit of archaeology and get back to you.

tjgq commented 1 month ago

For future reference, here's a super tiny repro:

cat >BUILD <<EOF
sh_test(srcs = ["test.sh"])
EOF
cat >.bazelrc <<EOF
build --disk_cache=/tmp/disk_cache
EOF
touch test.sh && chmod +x test.sh

echo "exit 0" > test.sh
bazel test :test   # build successful, disk cache miss
bazel test :test   # build successful, disk cache hit
echo "exit 1" > test.sh
bazel build :test  # build failure, disk cache miss
echo "exit 0" > test.sh
bazel test :test  # build successful, disk cache miss (but expected disk cache hit)

After some spelunking I think I know what the correct fix looks like. Expect a commit soon.

fmeum commented 1 month ago

@bazel-io flag

iancha1992 commented 1 month ago

@bazel-io fork 7.4.0