bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.21k stars 4.06k forks source link

Not caching remotely when test has failed previously #9389

Open exoson opened 5 years ago

exoson commented 5 years ago

Description of the problem / feature request:

When a test fails rerunning it doesn't cache the results remotely nor does it try to fetch the results from remote cache. Similar problems occur when using remote execution as well. In that case the Action and ExecuteResult which are pushed to the remote execution server have DoNotCache and SkipCacheLookup respectively set to true.

This can be a problem because changing the test doesn't actually reset the caching behavior so one has to rerun the tests so the status for previous test is passed which enables caching. We are currently deleting all the test.cache_status files under the bazel-testlogs symlink as a WAR.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Contents of the used .bazelrc is

build:remote_cache --remote_http_cache=<cacheip>
build:remote_cache --remote_local_fallback=true
build:remote_cache --remote_upload_local_results=true

Example shell script which reproduces the missing upload.

bazel test --config=remote_cache //:go_default_test # Test didn't pass
git apply fix_go_default_test.patch # Apply patch for fixing the test
bazel test --config=remote_cache //:go_default_test # Test passes but test result isn't put into the remote cache
bazel clean
bazel test --config=remote_cache //:go_default_test # Test isn't found in the remote cache and is run locally.

Example which reproduces the missing cache lookup

# Test is broken in environment1
environment1$ bazel test --config=remote_cache //:go_default_test 
# Test didn't pass
# Test is not broken in environment2
environment2$ bazel test --config=remote_cache //:go_default_test
# Test passes and result is uploaded to remote cache
# Fixed version of the test is fetched to environment1
environment1$ bazel test --config=remote_cache //:go_default_test
# Test isn't fetched from the remote cache and it is instead run locally.

What operating system are you running Bazel on?

Ubuntu 16.04

What's the output of bazel info release?

release 0.29.1

Have you found anything relevant by searching the web?

Nope

buchgr commented 5 years ago

Is what you are seeing that Bazel caches failed tests?

gergelyfabian commented 4 years ago

I have opened a similar issue in #11057.

ulfjack commented 4 years ago

I believe that Bazel doesn't write results from failed actions (including tests) to the remote cache. This is an intentional design decision.

gergelyfabian commented 4 years ago

I think it's rather about the result from the previous failed test masking the proper result from the remote cache.

ulfjack commented 4 years ago

I don't understand what you mean with "masking".

Nevertheless, there is indeed a bug there. When rerunning a failed test, Bazel adds the NO_CACHE tag to the spawn which prevents it from writing to the cache. It looks like the NO_CACHE tag is used to mean both "do not read from cache" and "do not write to cache", whereas the intention is to only mean "do not read from cache": https://github.com/bazelbuild/bazel/blob/9885c2f31b731957a79033759024a48254a67dca/src/main/java/com/google/devtools/build/lib/exec/StandaloneTestStrategy.java#L111

That's annoying. May need to add a NO_READ_CACHE tag and make both the remote cache and remote execution handle that.

scele commented 4 years ago

I don't understand what you mean with "masking".

I think @gergelyfabian refers to "Example which reproduces the missing cache lookup" reported in this issue, and https://github.com/bazelbuild/bazel/issues/11057: even if you have a passing test result stored in the remote cache, bazel may not accept it if the same test has previously failed locally (because the local failing result "masks" the remote passing result).

When rerunning a failed test, Bazel adds the NO_CACHE tag to the spawn which prevents it from writing to the cache. It looks like the NO_CACHE tag is used to mean both "do not read from cache" and "do not write to cache", whereas the intention is to only mean "do not read from cache" -- May need to add a NO_READ_CACHE tag and make both the remote cache and remote execution handle that.

I think we're basically asking if even using NO_READ_CACHE is necessary. If the test has a passing result in the remote cache, why can't bazel just use it regardless of the local state (previous failing local test result or not)?

gergelyfabian commented 4 years ago

I don't understand what you mean with "masking".

I think @gergelyfabian refers to "Example which reproduces the missing cache lookup" reported in this issue, and #11057: even if you have a passing test result stored in the remote cache, bazel may not accept it if the same test has previously failed locally (because the local failing result "masks" the remote passing result).

Exactly, thank you :)

I think we're basically asking if even using NO_READ_CACHE is necessary. If the test has a passing result in the remote cache, why can't bazel just use it regardless of the local state (previous failing local test result or not)?

Yes, I think this would be the natural behavior a user would expect.

ulfjack commented 4 years ago

I see your point. I think we'll need to create a table with all the possible input / output combinations and see what makes sense. I won't have time to work on it this week, unfortunately.

@meisterT - I'm not sure whether this affects Google, because it's using separate TestStrategy implementations. If it does, you might be losing remote execution capacity (my guesstimate is single-digit percent) and performance for some interactive builds. If we need to change the interface here (I suspect we need to change TestRunnerAction.shouldCacheResult()), the internal implementations will have to be adjusted as well.

meisterT commented 4 years ago

This is documented behavior: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/test/TestConfiguration.java;l=113

We can of course discuss whether it makes sense one way or the other.

ulfjack commented 4 years ago

It's not clear how the documentation should apply to remote caching / execution. For example, should Bazel be allowed to write to the remote cache after a failed test? It's writing to the local cache, right? Also, the remote cache generally doesn't store failed test results, so why would we need to explicitly tell it to not read a cached entry since it generally cannot be the result from the failing run?

The intent behind the documentation is that you do not get a cached failure, not that you don't get a cached pass (you can never get a cached pass from Skyframe or from the local action cache, and the documentation writer may only have thought of those two, not of the on-disk or remote caches). The documentation might actually pre-date the widespread use of remote caching inside Google - this part of the code is pretty old.

Regardless, we seem to have a case where we're losing performance and remote execution capacity for no obvious reason. Even if the documentation were fully prescriptive about the remote cache/execution, this seems like a good reason to, at least, consider changing the current behavior.

sgowroji commented 1 year ago

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

juanzolotoochin commented 9 months ago

We've been getting hit by this a few times and is very annoying. I also don't understand the point of the current behavior.

Why wouldn't bazel be able to use the remote cache results of a test just because it failed locally?

quic-fmedley commented 5 months ago

(Edit 2024-06-11 07:22 UTC: Missed that some of the runs were XML report generation.)

This problem is still in 7.2.0, Bazel is basically executing tests one extra time after failure. The minimal reproducible example below produces 5 different actions from 2 tests, return 1 and return 0:

return 1 version: size_bytes: 293, timeout: 5m, do_not_cache: false - test-setup.sh, correct size_bytes: 290, timeout: null, do_not_cache: true - generate-xml.sh, correct but why not cache?

return 0 version: size_bytes: 295, timeout: 5m, do_not_cache: true - test-setup.sh, WRONG size_bytes: 288, timeout: null, do_not_cache: false - generate-xml.sh, correct size_bytes: 293, timeout: 5m, do_not_cache: false - test-setup.sh, correct

These files are needed

my_test.c
int main() { return MY_TEST_RESULT; }

BUILD.bazel
cc_test(name = "my_test", srcs = ["my_test.c"])

.bazelrc
build --remote_executor=...

Test script

#!/bin/bash
set -eux -o pipefail

uuid=$(uuidgen)

bazel shutdown
bazel clean
# The test should fail, return value 1.
! bazel test :my_test --copt=-DMY_TEST_RESULT=1 --test_env=MYTESTID="${uuid}" --remote_grpc_log=grpc1.log
# The test should succeed, return value 0.
bazel test :my_test --copt=-DMY_TEST_RESULT=0 --test_env=MYTESTID="${uuid}" --remote_grpc_log=grpc2.log

# Check cache hit.
bazel shutdown
bazel clean
bazel test :my_test --copt=-DMY_TEST_RESULT=0 --test_env=MYTESTID="${uuid}" --remote_grpc_log=grpc3.log

set +x

# remote_client from https://github.com/bazelbuild/tools_remote
echo
echo "Result from empty cache, failing test"
remote_client --grpc_log grpc1.log printlog | grep -A15 TestRunner | grep -B9 -A8 -E 'GetActionResult|Execute' | grep -E 'method_name|hash|size_bytes|skip_cache_lookup'
echo
echo "Rebuilding, successful test"
remote_client --grpc_log grpc2.log printlog | grep -A15 TestRunner | grep -B9 -A8 -E 'GetActionResult|Execute' | grep -E 'method_name|hash|size_bytes|skip_cache_lookup'
echo
echo "Building again, should be 100% cache hit"
remote_client --grpc_log grpc3.log printlog | grep -A15 TestRunner | grep -B9 -A8 -E 'GetActionResult|Execute' | grep -E 'method_name|hash|size_bytes|skip_cache_lookup'

The test script output looks like this (with annotations):

Result from empty cache, failing test
method_name: "build.bazel.remote.execution.v2.ActionCache/GetActionResult"
        hash: "d9a2db63b62f8273ad8b6ea07152d988185831536768a8982f83820beb19d5ab"
        size_bytes: 293  <-- timeout=5m, doNotCache=false, test-setup.sh, CORRECT
method_name: "build.bazel.remote.execution.v2.Execution/Execute"
        hash: "d9a2db63b62f8273ad8b6ea07152d988185831536768a8982f83820beb19d5ab"
        size_bytes: 293  <-- Same as above
method_name: "build.bazel.remote.execution.v2.Execution/Execute"
      skip_cache_lookup: true
        hash: "8e2e2a2a71d52b7cc932468193202dbe1c184b515138b3da122d93b32a165709"
        size_bytes: 290  <-- doNotCache=true, generate-xml.sh, why no caching?

Rebuilding, successful test
method_name: "build.bazel.remote.execution.v2.Execution/Execute"
      skip_cache_lookup: true
        hash: "ec29bbf02f01e8777325a034ed13f33a9219105fb9a7144b456bab846c5b301f"
        size_bytes: 295  <-- timeout=5m, doNotCache=true, test-setup.sh, WRONG
method_name: "build.bazel.remote.execution.v2.ActionCache/GetActionResult"
        hash: "02a7d1673e3463a1e25a469caacbfafd1a570ad7600c6daf8bf4eded3559566b"
        size_bytes: 288  <-- timeout unset, doNotCache=false, generate-xml.sh
method_name: "build.bazel.remote.execution.v2.Execution/Execute"
        hash: "02a7d1673e3463a1e25a469caacbfafd1a570ad7600c6daf8bf4eded3559566b"
        size_bytes: 288  <-- Same as above

Building again, should be 100% cache hit
method_name: "build.bazel.remote.execution.v2.ActionCache/GetActionResult"
        hash: "3064a41bfc130899bfc9d869e3fbaf4caa2d76ea55777676246591309a054def"
        size_bytes: 293  <-- timeout=5m, doNotCache=false, test-setup.sh, CORRECT
method_name: "build.bazel.remote.execution.v2.Execution/Execute"
        hash: "3064a41bfc130899bfc9d869e3fbaf4caa2d76ea55777676246591309a054def"
        size_bytes: 293  <-- Same as above
method_name: "build.bazel.remote.execution.v2.ActionCache/GetActionResult"
        hash: "02a7d1673e3463a1e25a469caacbfafd1a570ad7600c6daf8bf4eded3559566b"
        size_bytes: 288  <-- timeout unset, doNotCache=false, generate-xml.sh