buchgr / bazel-remote

A remote cache for Bazel
https://bazel.build
Apache License 2.0
608 stars 157 forks source link

AC cache misses when action result exists on S3 backend #767

Open nmattia opened 3 months ago

nmattia commented 3 months ago

I have a bazel-remote v2.4.3 instance that serves 404 on Action Results that exists in its configured S3 backend. Other instances connected to the same S3 backend, with the same configuration, are able to serve those Action Results.

One action cache entry that surfaces the issue is 6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560. I've tried querying the bazel-remote instance with curl to reproduce the issue, and the implementation seems close enough to the gRPC implementation for this to be relevant (both go through GetValidatedActionResult). I'm definitely seeing the same behavior when going through gRPC/bazel.

This is the request I'm sending:

GET /ac/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560

The first log line I see appears to be a hit on the S3 backend for that entry:

11:25:28 S3 DOWNLOAD bucket ac/6e/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560 OK

However the line after that suggests the artifact is missing and bazel-remote returned a 404 (curl also shows a 404):

11:25:28  GET 404 42.42.42.42 /ac/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560

What's slightly surprising is that the second time I run the curl command, I do not see the instance querying the S3 backend, and instead logs a 404 directly:

11:34:54  GET 404 42.42.42.42 /ac/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560

I can get the AC entry from another instance connected to the same backend:

curl <other instance>/ac/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560 | protoc --decode_raw
protobuf output 2 { 1: "package.cache" 2 { 1: "903494c63057c6076dc4bdc43940cf93866a2d8a3af538964bfbf62e06b793fa" 2: 2693 } 4: 1 } 2 { 1: "libHSprimitive-0.8.0.0.a" 2 { 1: "6d7227f669ed2c506033c2a5c4556b387b125f58d77c80770c97819acfc40a20" 2: 1170436 } 4: 1 } ... 6 { 1: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" } 8 { 1: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" } ... }

and for each CAS blob returned by the (working) instance, I'm able to fetch them from the original/failing instance:

GET /cas/903494c63057c6076dc4bdc43940cf93866a2d8a3af538964bfbf62e06b793fa
-> 200

and yet the /ac/ call still returns 404, even after all the CAS entries were curled.

Any idea what might be happening here? The instance hasn't been upgraded to 2.4.4 in case something needs to be reproduced, though happy to close this issue if this has been fixed in 2.4.4.

mostynb commented 3 months ago

Hmm, interesting. I don't think any of the changes between 2.4.3 and 2.4.4 would be relevant to this.

On the problematic bazel-remote instance, are you saying that you request this AC entry with curl, it shows the S3 DOWNLOAD OK log line, then the next log line is a GET 404 for the same AC entry?

Are you running this bazel-remote instance with --max_blob_size or --max_proxy_blob_size and if so do you know if this AC entry is larger than either of those values?

nmattia commented 3 months ago

you request this AC entry with curl, it shows the S3 DOWNLOAD OK log line, then the next log line is a GET 404 for the same AC entry?

Yes, that's it! and when requesting the same AC entry it later on, there's no S3 line, it directly logs GET 404.

Are you running this bazel-remote instance with --max_blob_size or --max_proxy_blob_size and if so do you know if this AC entry is larger than either of those values?

We run with the following:

--max_size=200 --experimental_remote_asset_api=true --enable_endpoint_metrics

do note that other instances with the same configuration are able to serve the AC entry fine. Also, the AC entry is ~5MiB.

mostynb commented 3 months ago

On the first request, the AC entry is downloaded from S3 and stored in the local disk cache. The second request gets the AC entry directly from the local disk cache.

If you're getting "not found" then I suspect there's a validation failure of the entry. If the entry was validated successfully, then I think you would see log lines for each of the CAS blobs that are checked. I will try to improve the logging in this case, I might have a test build for you to try tomorrow.

nmattia commented 3 months ago

I might have a test build for you to try tomorrow.

Thanks! if additionally we can migrate without wiping the disk cache that would be ideal, as we're more likely to be able to reproduce the issue with that AC entry

mostynb commented 3 months ago

I haven't been found any interesting places to add extra logging yet, unfortunately. Are you using the http or grpc interface?

Can you try upgrading the problematic instance to v2.4.4? You should not need to wipe the disk cache when doing this.

nmattia commented 3 months ago

Are you using the http or grpc interface?

Using grpc with Bazel but I used the HTTP interface to reproduce the issue with curl. It looked like both were similar enough that debugging on the HTTP interface would help eventually resolve the grpc issue.

Can you try upgrading the problematic instance to v2.4.4? You should not need to wipe the disk cache when doing this.

Will do! I'll report back and let you know if the problem persists. Thanks for the quick feedback!

nmattia commented 3 months ago

@mostynb Updated to 2.4.4 and still seeing the issue. Here are the logs for an instance showing the issue (note that this is now happening on a different artifact, on a different instance, and most of the instances still work as expected):

2024/08/15 12:08:33 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:08:33 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:20:00 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
2024/08/15 12:20:00 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:21:22 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:21:56 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:23:52 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:27:14 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:27:26 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
2024/08/15 12:27:26 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
2024/08/15 12:27:48 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:43:18 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 13:40:51 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 13:41:24 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND

Note how at 2024/08/15 12:20:00 there's a successful S3 DOWNLOAD but then we get NOT FOUND. There's something odd here though, I'm surprised that both a CAS and an AC entries have the same hash, but maybe that something Bazel does? Can't be a coincidence though

mostynb commented 3 months ago

By convention bazel uses the hash of the Action (the command to run) as the lookup key for the ActionResult that it creates. Bazel is probably uploading the Action for debugging purposes (or maybe for consistency with the remote execution setup), even though it is not strictly required when only using remote caching. This probably explains why you see the same hash used for both an AC entry and a CAS entry.

There are two relevant functions to look at, diskCache.GetValidatedActionResult and diskCache.get.

The calling code looks at the values returned by GetValidatedActionResult, if the *pb.ActionResult is nil and the error is nil then a "not found" log is emitted. That happens in these places: 1) If the ActionResult was not found, or the size returned by diskCache.Get was <= 0. We know the action result was found, because we got it from the S3 backend. But maybe diskCache.Get erroneously returned a non-positive size here? https://github.com/buchgr/bazel-remote/blob/67e0fea647e79df70ac9071683323ece730858f8/cache/disk/disk.go#L777 2) If one of the OutputDirectories was not found. https://github.com/buchgr/bazel-remote/blob/67e0fea647e79df70ac9071683323ece730858f8/cache/disk/disk.go#L810 3) If findMissingCasBlobsInternal found missing blobs https://github.com/buchgr/bazel-remote/blob/67e0fea647e79df70ac9071683323ece730858f8/cache/disk/disk.go#L860

I added some extra debug logs to the places above in this branch in my fork: https://github.com/mostynb/bazel-remote/tree/issue_767_debugging Can you make a build, and try this out? Do you see any of the new warnings in the logs when this error occurs?

If not, we can look into diskCache.Get: diskCache.get (called by the diskCache.Get wrapper) returns a nil io.ReadCloser and nil error, then the caller treats that blob as not found. This happens in these places: 1) If the blob is not in the disk cache, and there is no backend proxy. This can't be the case, because you are using a backend proxy. https://github.com/buchgr/bazel-remote/blob/67e0fea647e79df70ac9071683323ece730858f8/cache/disk/disk.go#L623 2) If the blob was not in the disk cache, and not in the backend proxy. I don't think this would be the case, because we don't see any other S3 logs besides the ActionResult (or did you not include them in the report?) https://github.com/buchgr/bazel-remote/blob/67e0fea647e79df70ac9071683323ece730858f8/cache/disk/disk.go#L634 3) If the blob was found in the proxy, but it was larger than the maximum proxy blob size- do you set the --max_proxy_blob_size flag? https://github.com/buchgr/bazel-remote/blob/67e0fea647e79df70ac9071683323ece730858f8/cache/disk/disk.go#L638 4) If there was a size mismatch between the blob found in the proxy backend and the expected size https://github.com/buchgr/bazel-remote/blob/67e0fea647e79df70ac9071683323ece730858f8/cache/disk/disk.go#L642

nmattia commented 3 months ago

Thanks a lot! I'm seeing a lot of WARNING 3 though it may be another issue. When reproducing this particular issue, I see a WARNING 2.

Here's what's happening:

instance-19 12:29:11 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-19 12:29:11 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-07 12:29:23 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-07 12:29:23 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-10 12:29:23 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-10 12:29:23 GRPC BYTESTREAM WRITE COMPLETED: uploads/786723c7-6baf-4487-b8e3-310a7f5c2d53/compressed-blobs/zstd/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6/148
instance-10 12:29:23 S3 UPLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-32 12:29:23 GRPC AC PUT b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-32 12:29:23 S3 UPLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-24 12:30:12 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-24 12:30:12 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-10 12:30:19 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-10 12:30:19 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-32 12:30:31 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-32 12:30:31 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-23 12:30:31 GRPC AC PUT b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-23 12:30:31 S3 UPLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-25 12:31:30 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-25 12:31:30 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-13 12:31:35 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-13 12:31:35 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-25 12:31:40 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-25 12:31:45 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-31 12:31:49 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-31 12:31:49 WARNING 2: ActionResult b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 has a missing CAS object in OutputDirectories b431ba3377fc1d2bbd2eef8f448ebeae50a5b8d47a317c6e0d7556d11f48c834
instance-31 12:31:49 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-06 12:32:01 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-06 12:32:01 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-27 12:32:01 GRPC AC PUT b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-27 12:32:01 S3 UPLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-22 12:34:20 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-22 12:34:20 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK

All instances have the exact same configuration:

--max_size=200 --experimental_remote_asset_api=true --enable_endpoint_metrics

Let me know if you need more logs/info/etc!