buildbarn / bb-storage

Storage daemon, capable of storing data for the Remote Execution protocol
Apache License 2.0
137 stars 91 forks source link

Failed to fetch file errors in "builds without the bytes" builds in a sharded setup #169

Closed ferraith closed 1 year ago

ferraith commented 1 year ago

Dear Buildbarn developers,

we're facing since some time "Failed to fetch file with hash ..." errors thrown by bazel in case a build is triggered with enabled "builds without the bytes".

We investigated some time to find the root cause in our sharded Buildbarn setup and picked up an old statement from @EdSchouten that completeness checking should be disabled in sharded setups https://github.com/buildbarn/bb-storage/pull/33#issuecomment-566459358. Maybe we interpreted this statement wrong or completeness-checking with sharding works in the meantime without any issues. Nevertheless we think that completeness checking is required by bazel so simply disabling it, isn't an option for us.

Actually our Buildbarn setup looks like the following: We're distributing the AC as well as the CAS over serveral bb-storage with local-backends. In the following a snippet of your configuration showing the sharding:

contentAddressableStorage: {
    backend: {
      sharding: {
        ...
        shards: [
            ...
            backend: {
              grpc: {
                address: "<some address>",
                    }
                },
            ..
            ],
        },
    },
},
actionCache: {
    backend: {
      completenessChecking: {
        backend: {
            sharding: {
                ...
                shards: [
                    ...
                    backend: {
                        grpc: {
                            address: "<some address>",
                            },
                        },
                    ],
                }
            },
        },
        maximumTotalTreeSizeBytes: 64 * 1024 * 1024,
    },
}

We also had a look in the source code of bb-storage and maybe the following change increased the number of occurances of the "failed to fetch file" issue, because with an older bb-storage release we didn't face the error so much: https://github.com/buildbarn/bb-storage/commit/1b84fa824dc60e77776ce50e05c549fdf20c089b

Maybe some developer could give us feedback if there exists a misconfiguration in our sharded setup or maybe some bug was introduced with the linked change?

EdSchouten commented 1 year ago

We investigated some time to find the root cause in our sharded Buildbarn setup and picked up an old statement from @EdSchouten that completeness checking should be disabled in sharded setups

No, you should make sure that completenessChecking is enabled at the lowest level that has a complete view of the CAS. So in a sharded setup, you don't want to enable it at the individual storage node, but you do want to enable it on the frontends. Only those are capable of accurately performing these checks.

ferraith commented 1 year ago

Thx for your feedback @EdSchouten. Okay than I think our config is correct. The snippet above shows the config of one of our frontends which has a complete view of the CAS.

With enabled completeness-checking is it guaranteed that all cleaned up files are detected?

Could it be for instance that completeness-checking was successful but after the check finished but before all files are fetched some files are cleaned up? Some kind of race condition?

Actually we're missing an idea how to isolate the root cause but assume that linked change has some impact.

EdSchouten commented 1 year ago

CompletenessCheckingBlobAccess is implemented by calling FindMissingBlobs() against the CAS for each of the objects referenced by the ActionResult. The Remote Execution spec has the following to say:

  // Determine if blobs are present in the CAS.
  //
  // Clients can use this API before uploading blobs to determine which ones are
  // already present in the CAS and do not need to be uploaded again.
  //
  // Servers SHOULD increase the lifetimes of the referenced blobs if necessary and
  // applicable.
  //
  // There are no method-specific errors.
  rpc FindMissingBlobs(FindMissingBlobsRequest) returns (FindMissingBlobsResponse) {
    option (google.api.http) = { post: "/v2/{instance_name=**}/blobs:findMissing" body: "*" };
  }

Notice this sentence in particular:

  // Servers SHOULD increase the lifetimes of the referenced blobs if necessary and
  // applicable.

This is exactly what LocalBlobAccess implements: it refreshes objects close to expiration when requested through FindMissingBlobs() (and through Read() for what it's worth). This means that if you're running into issues where GetActionResult() returns results with dangling references, it most likely means that your CAS is too small, causing LocalBlobAccess to discard data too quickly.

ferraith commented 1 year ago

Thanks for your detailled explanation. So basically this means that our worst case retention time must be smaller than the build-time of your build-without-bytes build-jobs. Actually I'm not absolutely sure about that because we're seeing this behavior since updating to the latest bb-storage release but I can't proof it yet.

ferraith commented 1 year ago

Hello @EdSchouten, we investigated in the last two weeks a lot of time to find the root cause of the failed-to-fetch-file erros. Therefore we setup a test environment to reproduce the error in an isolated environment. Sadly we didn't find the root cause yet but I can forward new information to you.

We use relatively large shards (2400 GB) based on the LocalBlobBackend and a block config like this:

From the description of the LRU-like algorithm implemented in the LocalBlockBackend we assume this is a proper block config.

In our test environment we found out that these errors are happening also in case the CAS isn't fully filled with data (e.g. filled up to 10%). It's not starting directly after the CAS was erased but after the CAS is filled with some data bazel responses us this error. Because of this we assume that the LRU-algo isn't the root cause of the issue.

Could you think about of another root cause of the issue not related to LRU?

EdSchouten commented 1 year ago

Could it be the case your key-location map is too small?

ferraith commented 1 year ago

Hello @EdSchouten Thx for this hint! Indeed after resizing the key-location map and do more stress testing I could confirm that our key-location map was definitly too small.