`--experimental_remote_cache_eviction_retries` doesn't evict all metadata causing build failure

BalestraPatrick commented 1 year ago

Description of the bug:

Hello!

We have used --experimental_remote_cache_eviction_retries=1 with --remote_download_toplevel during the last few weeks, and we noticed from our data that in some cases, when Bazel exits with exit code 39, the following invocation will fail with the exact same error. This is not always the case (we have seen it correctly recover in some situations), but for one specific CppArchive action, we see it failing in about 1% of our builds.

The logs look like the following:

# first invocation
ERROR: /Users/path/BUILD.bazel:122:16: Linking fl.a failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248
...
Found remote cache eviction error, retrying the build...
...
# second invocation
ERROR: /Users/path/BUILD.bazel:122:16: Linking fl.a failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248

The only way to recover from this failure seems to be to run bazel clean or disable BwtB (--remote_download_toplevel).

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

In the gRPC log for the second invocation, we see the following:

---------------------------------------------------------
metadata {
  tool_details {
    tool_name: "bazel"
    tool_version: "6.1.0-7ccf98d" <- don't mind this, we build Bazel from source from our own fork and we currently pass the wrong version even though it's based off the 6.2.0 branch.
  }
  action_id: "7d45850557079c80512a065d2250586c4ac53cd91a468cc137e1ca59b3e1fbaa"
  tool_invocation_id: "007220b0-22de-401e-a407-5d6dd56225bf"
  correlated_invocations_id: "cdaeafdf-8dde-44e0-94a6-6dbc3c47c3a6"
  action_mnemonic: "CppArchive"
  target_id: "//path:lib"
  configuration_id: "702272d6c0c9a184ce867b340aaf98a381dfa7e5f316a7a64e42777e966f1345"
}
status {
  code: 1
  message: "com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248"
}
method_name: "google.bytestream.ByteStream/Write"
details {
  write {
  }
}
start_time {
  seconds: 1686831469
  nanos: 922000000
}
end_time {
  seconds: 1686831469
  nanos: 943000000
}

Which operating system are you running Bazel on?

macOS

What is the output of `bazel info release`?

6.2.0 @ 286306e

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

brentleyjones commented 1 year ago

@coeuvre

coeuvre commented 1 year ago

That's interesting. Maybe I am wrong, but I don't think the code for cleaning Bazel's internal state relies on the specific actions, i.e. if it works for other actions, it should work for CppArchive.

For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?

Is that possible to collect logs from server side of the access patterns for this particular blob? e.g. create time, refresh time, delete time, and etc.

brentleyjones commented 1 year ago

For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?

I don't think it matters? The first build fails with missing cache item, and it rebuilds the whole build, and the input shouldn't be assumed to exist, so the action that generates the missing item should be re-run. That's not happening. Instead the dependent action is simply run again (which is guaranteed to fail).

coeuvre commented 1 year ago

The generating action is expected to be re-run. Do you have execution log for the generating action from the second invoction? Otherwise, it might be the case that the file was evicted again before Bazel requested to download it.

brentleyjones commented 1 year ago

Patrick should be able to provide that.

The build log shows only 2 actions though (status.txt and the dependent action), in the gRPC log for the second run there isn't any upload (or any calls really) for that blob/action, and it was part of the findMissingBlobs call, so I assume that means Bazel knows that it wasn't generated locally.

coeuvre commented 1 year ago

The build log shows only 2 actions though (status.txt and the dependent action)

This looks like a Bazel bug to me. Probably the state was not cleaned due to race conditions.

BalestraPatrick commented 1 year ago

@coeuvre I sent you a gRPC log that I had from the second invocation that caused the build to fail. I hope it's useful. I can add some logging for collecting an execution log, and hope to catch it again in the next few days.

brentleyjones commented 8 months ago

@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.

iancha1992 commented 8 months ago

@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.

cc: @coeuvre for visibility

coeuvre commented 8 months ago

IIRC, the log @BalestraPatrick shared with me didn't reveal the root cause.

How frequent is the issue now in Bazel 7? Can you capture the logs again given that the code has changed a lot in Bazel 7?

acecilia commented 6 months ago

We are also experiencing this using Bazel 7, but is pretty hard to reproduce, I did not manage to reproduce it yet in a way that I can troubleshoot it

oliviernotteghem commented 6 months ago

We're hitting the same issue too, and often enough that BwoB is not usable for our (local) development environment. We're using Bazel 7. Can this be prioritize?

coeuvre commented 6 months ago

It's worth trying whether this is fixed by the upcoming release 7.2 which includes eda0fe4399b85d3bbc39fda2701287e014ecb919.

scaiper commented 5 months ago

7.2.0rc2 does not fix the issue for us.

coeuvre commented 5 months ago

@scaiper How frequently it happens? Is it possible to share a minimal repro?

scaiper commented 5 months ago

@coeuvre We use Buildbarn as a remote cache. Whenever cache eviction happens we see this error. To reproduce just delete CAS from Buildbarn and leave AC intact.

We were able to workaround this by enabling this option in Buildbarn.

// Only return ActionResult messages for which all output files are
// present in the Content Addressable Storage (CAS). Certain
// clients, such as Bazel, require the use of this decorator. To
// reduce latency, it is advised that this decorator is used at the
// lowest level that has a full view of the entire CAS.
//
// This decorator must be placed on the Action Cache.
CompletenessCheckingBlobAccessConfiguration completeness_checking = 17;

It looks like Bazel assumes without checking that all outputs of cached actions exist in CAS, which is not the case.

coeuvre commented 5 months ago

As described in the comment, Bazel requires the remote cache only returns AC when all the referenced are present in the CAS. So it's more like a configuration issue than a bug in your case.

scaiper commented 5 months ago

If it is working as designed, it is very unexpected.

acecilia commented 5 months ago

Bazel requires the remote cache only returns AC when all the referenced are present in the CAS

Any idea how to make sure this happens when using google cloud?

EdSchouten commented 5 months ago

@scaiper Note that REv2 is quite explicit about this:

https://github.com/bazelbuild/remote-apis/blob/7f51b3676b5d19df726783d9a861e45d7389b5ae/build/bazel/remote/execution/v2/remote_execution.proto#L160-L165

Enabling complenessChecking is not a workaround. It's a necessity. Notice how the example configuration in bb-storage's README actually has it enabled. So the issue described in this ticket is essentially "Bazel doesn't work well if my remote cache doesn't behave according to the spec."

acecilia commented 5 months ago

Isnt this somewhat incompatible with the idea of artifact eviction in a remote cache?

To my understanding, this means that the cache has to be smarter and understand how to handle the TTL of stored cache artifacts based on the requirements of the consumer (in this case bazel)

EdSchouten commented 5 months ago

Isnt this somewhat incompatible with the idea of artifact eviction in a remote cache?

Nope! Bazel is absolutely free to cache knowledge on which objects exist in the CAS. But for Bazel to accurately acquire this knowledge it must be able to rely on the remote cache returning references to valid objects as part of GetActionResult.

If the cache doesn’t do that, then Bazel has no way to recover. Namely, if it detects objects have gone missing Bazel would purge its own caches. But while rebuilding it would again call GetActionResult against the same action, and again get references to non-existent CAS objects. Which would cause Bazel to attempt recovery again, which […]

acecilia commented 5 months ago

Bazel is absolutely free to cache knowledge on which objects exist in the CAS

I have very limited knowledge of how this works on bazel side 😞 But given what you mention above, this scenario would happen:

Bazel fetches the cache, gathering information about which objects exist in CAS
Bazel caches this information
Then the TTL for some of these objects is reached, and the remote cache removes them
After that bazel, relying on its cached knowledge about the objects existence, tries to fetch the objects
But the objects are not in the cache anymore, thus failing

If the cache doesn’t do that, then Bazel has no way to recover. Namely, if it detects objects have gone missing Bazel would purge its own caches

Isnt this what --experimental_remote_cache_eviction_retries tries to recover from? That if a cache error is detected, the build is restarted, with the idea that the remote cache will provide the information again, this second time letting bazel know that the objects are not existent anymore, and thus allowing it to handle it gracefully and complete the build

GetActionResult against the same action, and again get references to non-existent CAS objects

I believe this would be a bug in the cache: the cache is telling a consumer that an object exists, when it does not 🤔

EdSchouten commented 5 months ago

GetActionResult against the same action, and again get references to non-existent CAS objects

I believe this would be a bug in the cache: the cache is telling a consumer that an object exists, when it does not 🤔

Exactly! And that’s what happens if you run Buildbarn’s bb_storage daemon without CompletenessChecking. So just make sure to enable that.

acecilia commented 5 months ago

Im using Google cloud, I am not aware of any configuration related to this :/

coeuvre commented 4 months ago

Im using Google cloud, I am not aware of any configuration related to this :/

Are you using RBE in google cloud or GCS for remote cache?

For RBE I believe it always does the completeness check. For GCS, it can't since it's just a blob storage which doesn't understand the relations between blobs.

acecilia commented 4 months ago

I am using GCS.

For GCS, it can't since it's just a blob storage which doesn't understand the relations between blobs.

This is exactly what I mentioned here. The cache cannot be a generic cache: it needs to be smart in order to handle bazel requirements

coeuvre commented 4 months ago

If you are using a generic blob store (which doesn't understand REAPI) and the store can evict blobs, you have to set --remote_download_all.

acecilia commented 4 months ago

If you are using a generic blob store (which doesn't understand REAPI) and the store can evict blobs, you have to set --remote_download_all.

I see. Could you elaborate a bit why? I would expect is possible to use --remote_download_minimal together with --experimental_remote_cache_eviction_retries, and the following scenario would happen:

Bazel fetches the cache, gathering information about which objects exist in CAS
Bazel caches this information
Then the TTL for some of these objects is reached, and the remote cache removes them
After that bazel, relying on its cached knowledge about the objects existence, tries to fetch the objects
But the objects are not in the cache anymore, thus failing the build
But because of --experimental_remote_cache_eviction_retries being enabled, the build re-starts again
Bazel drops the caches that hold the CAS existence and re-fetches the remote cache: this time the remote cache response does not include the objects that were just recently removed due to TTL
All objects expected by bazel exist in the remote cache, and the build completes successfully

coeuvre commented 4 months ago

7. Bazel drops the caches that hold the CAS existence and re-fetches the remote cache: this time the remote cache response does not include the objects that were just recently removed due to TTL

This is the problem. Since a generic cache doesn't understand AC or CAS, while Bazel can drop the in-memory/local cache, the next time it queries the AC, the remote cache can still return a cache hit even if the referenced blobs by that AC entry in the CAS are evicted. It only breaks the loop after the remote cache evicts the AC entries.

acecilia commented 4 months ago

Thanks! Would it be a desirable behaviour that when --experimental_remote_cache_eviction_retries is enabled, bazel keeps track of such outdated AC entry when re-trying the build, ignores the cache for it and re-executes the action, which upon completion would then refresh both the AC and CAS caches?

coeuvre commented 4 months ago

I can see this might be a nice workaround when connecting Bazel to a generic cache. Can you open a new issue and move the discussion there?

dragos commented 3 months ago

@scaiper Note that REv2 is quite explicit about this:

https://github.com/bazelbuild/remote-apis/blob/7f51b3676b5d19df726783d9a861e45d7389b5ae/build/bazel/remote/execution/v2/remote_execution.proto#L160-L165

Enabling complenessChecking is not a workaround. It's a necessity

I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec.

criemen commented 3 months ago

I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec.

This doesn't really vibe with https://bazel.build/remote/caching, which a) advertises the possiblity to bring any HTTP cache b) explicitly tells you for the google cloud storage backend that you can enable object lifecycles to automatically delete old objects.

Even if b were not on the web page, it's pretty much implict in option a - if you bring a HTTP cache, you need to clean it up, you can't have unbounded growth of your cache. So what do you do? Expire the objects that either have lived the longest, or some sort of LRU if your cache is fancier. Bazel ought to be able to handle both, or otherwise it needs to be explicitly stated that these strategies are not supported, and you need to defer to special bazel cache software instead.

criemen commented 2 months ago

@coeuvre I'm not sure this is 100% necessary - if using a HTTP cache like buchgr/bazel-remote, that does understand CAS and AC relationships, and it can filter out AC results that reference expired CAS hits. Therefore it'd be great to be able to force-enable BwoB in those cases (due to havingto connect through nginx (outside of my control), I can't connect via grpc to my bazel cache service instance, but I'd like to continue using Bwob, as nginx doesn't proxy http2 🙄 ).

coeuvre commented 2 months ago

I will revert the change and explore other possible solutions.

tgeng commented 2 months ago

Will the revert (https://github.com/bazelbuild/bazel/commit/b89978317cf14a7f3d24616a3bf66e79bccb2482) be cherry-picked to 7.4.0 release?

coeuvre commented 2 months ago

The original commit is not cherry-picked to 7.4.0 so cherry-pick for the revert is not necessary.

coeuvre commented 2 months ago

@bazel-io fork 7.4.0

iancha1992 commented 1 month ago

A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1. Thanks!

acecilia commented 4 weeks ago

I am testing 7.4.0 RC2. Will update here if I observe build failures. If I dont say anything consider it as a validation that the fix works as expected 😄

acecilia commented 2 weeks ago

👋 Using 7.4.0rc2 I continue observing errors like:

ERROR: /Users/.../MyTarget/BUILD.bazel:3:16: Bundling, processing and signing MyTarget.__internal__.__test_bundle failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356/3 for bazel-out/ios_sim_arm64-fastbuild-ios-sim_arm64-min13.0-applebin_ios-ST-37c686bf6546/bin/.../OtherTarget.bundle/response.json

My configuration is:

Without customization of --experimental_remote_cache_eviction_retries, so it defaults to 5
Using --remote_download_toplevel
Use a google cloud bucket as remote cache

The thing is: I only see the error once in the logs, but I would expect to see it 5 times due to the default value of --experimental_remote_cache_eviction_retries 🤔

coeuvre commented 2 weeks ago

Since you mentioned the error only appeared once in the log, does it mean the build completed successfully after retries? If not, what's the error message?

acecilia commented 2 weeks ago

Nope, the build failed after the Failed to fetch blobs because they do not exist remotely error. So I am wondering if the default value 5 of --experimental_remote_cache_eviction_retries is somehow not real 🤔

coeuvre commented 2 weeks ago

Argh, I think --experimental_remote_cache_eviction_retries only defaults to 5 in 8.x. Can you try with explicitly setting the flag?

acecilia commented 2 weeks ago

Aha, You are right, that explains it. Thanks! I will set it and continue monitoring 🤞

bazelbuild / bazel