facebook / buck2

Build system, successor to Buck
https://buck2.build/
Apache License 2.0
3.6k stars 223 forks source link

Materializer chokes on uploading action cache artifacts that exceed default 4MB gRPC limit #170

Closed thoughtpolice closed 1 year ago

thoughtpolice commented 1 year ago

If you follow along along with the saga in https://github.com/thoughtpolice/buck2-nix/issues/12, I have almost got remote execution working on buildbarn with Nix! Except that the tarball I download seems to be a tad bit too large (formatted for legibility in the github UI):

Action failed: prelude//toolchains:nixpkgs-overlay-rust (download_tarball https://github.com/oxalica/rust-overlay/archive/1373567ffd13719f6b7522737b010bfc514d49b4.tar.gz)

Internal error (stage: download): action_digest=c8b86439a7e637e7c7d98cf2addf516a771ed82f:93:
Failed to declare in materializer: status: InvalidArgument, message:

"Attempted to read a total of at least 23508808 bytes, while a maximum of 16777216 bytes is permitted", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }

I can see that the action itself completed successfully in the buildbarn action cache UI, and that the file downloaded (given in the error above, from rust-overlay) is 23508808 bytes in size:

screenshot-buildbarn

I'm not sure how to approach this. Maybe it's a gRPC limit? The CacheCapabilities don't exactly specify anything about this:

austin@GANON:~/src/reapi-server$ grpcurl -plaintext '127.0.0.1:8980' build.bazel.remote.execution.v2.Capabilities/GetCapabilities
{
  "cacheCapabilities": {
    "digestFunctions": [
      "MD5",
      "SHA1",
      "SHA256",
      "SHA256TREE",
      "SHA384",
      "SHA512"
    ],
    "actionCacheUpdateCapabilities": {
      "updateEnabled": true
    },
    "symlinkAbsolutePathStrategy": "ALLOWED"
  },
  "executionCapabilities": {
    "digestFunction": "SHA256",
    "execEnabled": true,
    "executionPriorityCapabilities": {
      "priorities": [
        {
          "minPriority": -2147483648,
          "maxPriority": 2147483647
        }
      ]
    },
    "digestFunctions": [
      "MD5",
      "SHA1",
      "SHA256",
      "SHA256TREE",
      "SHA384",
      "SHA512"
    ]
  },
  "lowApiVersion": {
    "major": 2
  },
  "highApiVersion": {
    "major": 2
  }
}
thoughtpolice commented 1 year ago

The rounded 16 limit was curious so I poked around, and I was able to work around this error by increasing the message size for GRPC messages in buildbarn, it looks like:

diff --git a/buck/nix/bb/docker/config/common.libsonnet b/buck/nix/bb/docker/config/common.libsonnet
--- a/buck/nix/bb/docker/config/common.libsonnet
+++ b/buck/nix/bb/docker/config/common.libsonnet
@@ -38,7 +38,7 @@
   },
   browserUrl: 'http://localhost:7984',
   httpListenAddress: ':80',
-  maximumMessageSizeBytes: 16 * 1024 * 1024,
+  maximumMessageSizeBytes: 64 * 1024 * 1024,
   global: {
     diagnosticsHttpServer: {
       listenAddress: ':9980',

Surely there's a better way in the reapi to download large action cache blobs incrementally without bloating the individual gRPC message limit, though...

krallin commented 1 year ago

Hmm, I will double check but I don’t recall the CAS GRPC API exposing range reads so it’s possible that raising this limit is indeed the right approach.

On Sat, 15 Apr 2023 at 19:09, Austin Seipp @.***> wrote:

The rounded 16 limit was curious so I poked around, and I was able to work around this error by increasing the message size for GRPC messages in buildbarn, it looks like:

diff --git a/buck/nix/bb/docker/config/common.libsonnet b/buck/nix/bb/docker/config/common.libsonnet --- a/buck/nix/bb/docker/config/common.libsonnet +++ b/buck/nix/bb/docker/config/common.libsonnet @@ -38,7 +38,7 @@ }, browserUrl: 'http://localhost:7984', httpListenAddress: ':80',

  • maximumMessageSizeBytes: 16 1024 1024,
  • maximumMessageSizeBytes: 64 1024 1024, global: { diagnosticsHttpServer: { listenAddress: ':9980',

Surely there's a better way in the reapi to download large action cache blobs incrementally without bloating the individual gRPC message limit, though...

— Reply to this email directly, view it on GitHub https://github.com/facebook/buck2/issues/170#issuecomment-1509895083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANIHVVFNBHTKKKELSKPWCTXBLI4NANCNFSM6AAAAAAW7QDLWU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

thoughtpolice commented 1 year ago

I tripped up on something else now too, message size decoding limit. Looks like it's trying to download the entirety of the output artifact, which is about 160MB, while the gRPC default limit in tonic is 4MB(!) A bump to tonic for 0.9.0+ is needed to fix that, I think:

Action failed: prelude//toolchains/rust:rust-stable (nix_build rust-stable-rust-stable.nix)

Internal error (stage: remote_upload_error): Remote Execution Error (GRPC-SESSION-ID):
RE: upload: status: ResourceExhausted, message: "grpc: received message larger than max (165245018 vs. 4194304)", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc"} }

Testing a patch for that now, possibly...

thoughtpolice commented 1 year ago

Ah, it's actually a BuildBarn error, I think. The problem isn't downloading, (which would fix the decoding size), it's uploading: https://grep.app/search?q=grpc%3A%20received%20message%20larger%20than%20max

So I think there the problem is actually writing smaller blobs to the store, not reading things out of it...

thoughtpolice commented 1 year ago

Ah, right. So the spec says https://github.com/thoughtpolice/reapi-server/blob/eae2e3a51bb8053ae7fe290fb1e9fedb462b651a/protos/build/bazel/remote/execution/v2/remote_execution.proto#L205-L209

// For small file uploads the client should group them together and call
// [BatchUpdateBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.BatchUpdateBlobs].
//
// For large uploads, the client must use the
// [Write method][google.bytestream.ByteStream.Write] of the ByteStream API.

So it instead needs to write content to google.bytestream.ByteStream, which you can verify exists on BuildBarn:

austin@GANON:~/src/reapi-server$ grpcurl -plaintext '127.0.0.1:8980' describe | grep 'is a service'
build.bazel.remote.execution.v2.ActionCache is a service:
build.bazel.remote.execution.v2.Capabilities is a service:
build.bazel.remote.execution.v2.ContentAddressableStorage is a service:
build.bazel.remote.execution.v2.Execution is a service:
google.bytestream.ByteStream is a service:
grpc.health.v1.Health is a service:
grpc.reflection.v1alpha.ServerReflection is a service:
krallin commented 1 year ago

It does look like we should use this API, yeah (for reads & writes), the RE bits are arguably MVP-level right now sicne we do not use them internally.

Is that something you'd wanted to do yourself since you've been debugging it? Otherwise I'll try to see if I can find the time

The code for all this is in remote_execution/oss/re_grpc/src/client.rs. This will also require adding the ByteStream service protobuf definitions at remote_execution/oss/re_grpc_proto/proto/google/bytestream/bytestream.proto

thoughtpolice commented 1 year ago

Yeah, I'll try giving it a shot, since the code looks fairly isolated.

benbrittain commented 1 year ago

I just "solved" this on the buildbarn side by setting maximumReceivedMessageSizeBytes under the grpc server section (not not the same as maximumMessageSizeBytes).

It's rather slow, so not a total fix, but it's something.

benbrittain commented 1 year ago

193 has been merged, which should address this