Closed BalestraPatrick closed 2 months ago
@coeuvre
That's interesting. Maybe I am wrong, but I don't think the code for cleaning Bazel's internal state relies on the specific actions, i.e. if it works for other actions, it should work for CppArchive
.
For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?
Is that possible to collect logs from server side of the access patterns for this particular blob? e.g. create time, refresh time, delete time, and etc.
For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?
I don't think it matters? The first build fails with missing cache item, and it rebuilds the whole build, and the input shouldn't be assumed to exist, so the action that generates the missing item should be re-run. That's not happening. Instead the dependent action is simply run again (which is guaranteed to fail).
The generating action is expected to be re-run. Do you have execution log for the generating action from the second invoction? Otherwise, it might be the case that the file was evicted again before Bazel requested to download it.
Patrick should be able to provide that.
The build log shows only 2 actions though (status.txt and the dependent action), in the gRPC log for the second run there isn't any upload (or any calls really) for that blob/action, and it was part of the findMissingBlobs call, so I assume that means Bazel knows that it wasn't generated locally.
The build log shows only 2 actions though (status.txt and the dependent action)
This looks like a Bazel bug to me. Probably the state was not cleaned due to race conditions.
@coeuvre I sent you a gRPC log that I had from the second invocation that caused the build to fail. I hope it's useful. I can add some logging for collecting an execution log, and hope to catch it again in the next few days.
@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.
@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.
cc: @coeuvre for visibility
IIRC, the log @BalestraPatrick shared with me didn't reveal the root cause.
How frequent is the issue now in Bazel 7? Can you capture the logs again given that the code has changed a lot in Bazel 7?
We are also experiencing this using Bazel 7, but is pretty hard to reproduce, I did not manage to reproduce it yet in a way that I can troubleshoot it
We're hitting the same issue too, and often enough that BwoB is not usable for our (local) development environment. We're using Bazel 7. Can this be prioritize?
It's worth trying whether this is fixed by the upcoming release 7.2 which includes eda0fe4399b85d3bbc39fda2701287e014ecb919.
7.2.0rc2 does not fix the issue for us.
@scaiper How frequently it happens? Is it possible to share a minimal repro?
@coeuvre We use Buildbarn as a remote cache. Whenever cache eviction happens we see this error. To reproduce just delete CAS from Buildbarn and leave AC intact.
We were able to workaround this by enabling this option in Buildbarn.
// Only return ActionResult messages for which all output files are
// present in the Content Addressable Storage (CAS). Certain
// clients, such as Bazel, require the use of this decorator. To
// reduce latency, it is advised that this decorator is used at the
// lowest level that has a full view of the entire CAS.
//
// This decorator must be placed on the Action Cache.
CompletenessCheckingBlobAccessConfiguration completeness_checking = 17;
It looks like Bazel assumes without checking that all outputs of cached actions exist in CAS, which is not the case.
As described in the comment, Bazel requires the remote cache only returns AC when all the referenced are present in the CAS. So it's more like a configuration issue than a bug in your case.
If it is working as designed, it is very unexpected.
Bazel requires the remote cache only returns AC when all the referenced are present in the CAS
Any idea how to make sure this happens when using google cloud?
@scaiper Note that REv2 is quite explicit about this:
Enabling complenessChecking
is not a workaround. It's a necessity. Notice how the example configuration in bb-storage's README actually has it enabled. So the issue described in this ticket is essentially "Bazel doesn't work well if my remote cache doesn't behave according to the spec."
Isnt this somewhat incompatible with the idea of artifact eviction in a remote cache?
To my understanding, this means that the cache has to be smarter and understand how to handle the TTL of stored cache artifacts based on the requirements of the consumer (in this case bazel)
Isnt this somewhat incompatible with the idea of artifact eviction in a remote cache?
Nope! Bazel is absolutely free to cache knowledge on which objects exist in the CAS. But for Bazel to accurately acquire this knowledge it must be able to rely on the remote cache returning references to valid objects as part of GetActionResult.
If the cache doesn’t do that, then Bazel has no way to recover. Namely, if it detects objects have gone missing Bazel would purge its own caches. But while rebuilding it would again call GetActionResult against the same action, and again get references to non-existent CAS objects. Which would cause Bazel to attempt recovery again, which […]
Bazel is absolutely free to cache knowledge on which objects exist in the CAS
I have very limited knowledge of how this works on bazel side 😞 But given what you mention above, this scenario would happen:
If the cache doesn’t do that, then Bazel has no way to recover. Namely, if it detects objects have gone missing Bazel would purge its own caches
Isnt this what --experimental_remote_cache_eviction_retries
tries to recover from? That if a cache error is detected, the build is restarted, with the idea that the remote cache will provide the information again, this second time letting bazel know that the objects are not existent anymore, and thus allowing it to handle it gracefully and complete the build
GetActionResult against the same action, and again get references to non-existent CAS objects
I believe this would be a bug in the cache: the cache is telling a consumer that an object exists, when it does not 🤔
GetActionResult against the same action, and again get references to non-existent CAS objects
I believe this would be a bug in the cache: the cache is telling a consumer that an object exists, when it does not 🤔
Exactly! And that’s what happens if you run Buildbarn’s bb_storage daemon without CompletenessChecking. So just make sure to enable that.
Im using Google cloud, I am not aware of any configuration related to this :/
Im using Google cloud, I am not aware of any configuration related to this :/
Are you using RBE in google cloud or GCS for remote cache?
For RBE I believe it always does the completeness check. For GCS, it can't since it's just a blob storage which doesn't understand the relations between blobs.
I am using GCS.
For GCS, it can't since it's just a blob storage which doesn't understand the relations between blobs.
This is exactly what I mentioned here. The cache cannot be a generic cache: it needs to be smart in order to handle bazel requirements
If you are using a generic blob store (which doesn't understand REAPI) and the store can evict blobs, you have to set --remote_download_all
.
If you are using a generic blob store (which doesn't understand REAPI) and the store can evict blobs, you have to set --remote_download_all.
I see. Could you elaborate a bit why? I would expect is possible to use --remote_download_minimal
together with --experimental_remote_cache_eviction_retries
, and the following scenario would happen:
--experimental_remote_cache_eviction_retries
being enabled, the build re-starts again7. Bazel drops the caches that hold the CAS existence and re-fetches the remote cache: this time the remote cache response does not include the objects that were just recently removed due to TTL
This is the problem. Since a generic cache doesn't understand AC or CAS, while Bazel can drop the in-memory/local cache, the next time it queries the AC, the remote cache can still return a cache hit even if the referenced blobs by that AC entry in the CAS are evicted. It only breaks the loop after the remote cache evicts the AC entries.
Thanks! Would it be a desirable behaviour that when --experimental_remote_cache_eviction_retries
is enabled, bazel keeps track of such outdated AC entry when re-trying the build, ignores the cache for it and re-executes the action, which upon completion would then refresh both the AC and CAS caches?
I can see this might be a nice workaround when connecting Bazel to a generic cache. Can you open a new issue and move the discussion there?
@scaiper Note that REv2 is quite explicit about this:
Enabling
complenessChecking
is not a workaround. It's a necessity
I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec.
I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec.
This doesn't really vibe with https://bazel.build/remote/caching, which a) advertises the possiblity to bring any HTTP cache b) explicitly tells you for the google cloud storage backend that you can enable object lifecycles to automatically delete old objects.
Even if b were not on the web page, it's pretty much implict in option a - if you bring a HTTP cache, you need to clean it up, you can't have unbounded growth of your cache. So what do you do? Expire the objects that either have lived the longest, or some sort of LRU if your cache is fancier. Bazel ought to be able to handle both, or otherwise it needs to be explicitly stated that these strategies are not supported, and you need to defer to special bazel cache software instead.
@coeuvre I'm not sure this is 100% necessary - if using a HTTP cache like buchgr/bazel-remote
, that does understand CAS and AC relationships, and it can filter out AC results that reference expired CAS hits. Therefore it'd be great to be able to force-enable BwoB in those cases (due to havingto connect through nginx (outside of my control), I can't connect via grpc to my bazel cache service instance, but I'd like to continue using Bwob, as nginx doesn't proxy http2 🙄 ).
I will revert the change and explore other possible solutions.
Will the revert (https://github.com/bazelbuild/bazel/commit/b89978317cf14a7f3d24616a3bf66e79bccb2482) be cherry-picked to 7.4.0 release?
The original commit is not cherry-picked to 7.4.0 so cherry-pick for the revert is not necessary.
@bazel-io fork 7.4.0
A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.0rc1
. Thanks!
I am testing 7.4.0 RC2. Will update here if I observe build failures. If I dont say anything consider it as a validation that the fix works as expected 😄
👋 Using 7.4.0rc2
I continue observing errors like:
ERROR: /Users/.../MyTarget/BUILD.bazel:3:16: Bundling, processing and signing MyTarget.__internal__.__test_bundle failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356/3 for bazel-out/ios_sim_arm64-fastbuild-ios-sim_arm64-min13.0-applebin_ios-ST-37c686bf6546/bin/.../OtherTarget.bundle/response.json
My configuration is:
--experimental_remote_cache_eviction_retries
, so it defaults to 5--remote_download_toplevel
The thing is: I only see the error once in the logs, but I would expect to see it 5 times due to the default value of --experimental_remote_cache_eviction_retries
🤔
Since you mentioned the error only appeared once in the log, does it mean the build completed successfully after retries? If not, what's the error message?
Nope, the build failed after the Failed to fetch blobs because they do not exist remotely
error. So I am wondering if the default value 5
of --experimental_remote_cache_eviction_retries
is somehow not real 🤔
Argh, I think --experimental_remote_cache_eviction_retries
only defaults to 5
in 8.x. Can you try with explicitly setting the flag?
Aha, You are right, that explains it. Thanks! I will set it and continue monitoring 🤞
Description of the bug:
Hello!
We have used
--experimental_remote_cache_eviction_retries=1
with--remote_download_toplevel
during the last few weeks, and we noticed from our data that in some cases, when Bazel exits with exit code 39, the following invocation will fail with the exact same error. This is not always the case (we have seen it correctly recover in some situations), but for one specificCppArchive
action, we see it failing in about 1% of our builds.The logs look like the following:
The only way to recover from this failure seems to be to run
bazel clean
or disable BwtB (--remote_download_toplevel
).What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
In the gRPC log for the second invocation, we see the following:
Which operating system are you running Bazel on?
macOS
What is the output of
bazel info release
?6.2.0 @ 286306e
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response