buchgr / bazel-remote

A remote cache for Bazel
https://bazel.build
Apache License 2.0
595 stars 154 forks source link

Allow setting max received message size #658

Open theoribeiro opened 1 year ago

theoribeiro commented 1 year ago

I'm getting an error whenever trying to upload a cache using pants with the error message below: ResourceExhausted: "grpc: received message larger than max (6909987 vs. 4194304)"

This happens because bazel-remote keeps the default grpc max received message size and does not have a way to configure it. I can confirm that by setting MaxRecvMsgSize to more than 4MB as an option on the server fixes the issue.

Would you be open for a PR adding that as configurable setting?

mostynb commented 1 year ago

Hmm, that's a pretty large grpc message for pants to be sending.

I wonder if bazel-remote were to report 4MB instead of 0/unlimited in CacheCapabilities.MaxBatchTotalSizeBytes, would pants only sent messages under that size limit? If so that might be a better option.

(Also, I wonder if you would consider submitting some documentation for how to setup pants to use bazel-remote?)

theoribeiro commented 1 year ago

Very interesting point, @mostynb. I know that pants is using tonic, a Rust grpc client implementation but as far as I could briefly see from the source code, it doesn't look like they're doing anything exotic in terms of the message sizes.

However, I just tried setting MaxBatchTotalSizeBytes: to 4194304 at server/grpc.go#121 but I still get the same error.

Setting up pants to work with bazel-remote is pretty straight forward. A simple pants.toml config file in your project root will do:

# pants.toml

[GLOBAL]
remote_store_address = "grpc://localhost:9092"
remote_cache_read = true
remote_cache_write = true
mostynb commented 1 year ago

Thanks for the config example.

I think this is something that the client and server should be able to negotiate automatically. The client should query the server's capabilities to get the maximum size for a grpc message it will accept. If the returned result is 0 then the client should choose a reasonable value (no larger than 4Mb).

This feels like a pants bug that should be fixed there, unless I misunderstand what's supposed to happen when the server returns 0. Since you tested the server returning 4Mb explicitly, I'm leaning towards this being a pants bug. In the meantime, pants has an option that you could try using to control this: --remote-store-batch-api-size-limit=<int> https://www.pantsbuild.org/docs/reference-global#remote_store_batch_api_size_limit - does using that work?

theoribeiro commented 1 year ago

I'm leaning towards this being a pants bug. In the meantime, pants has an option that you could try using to control this: --remote-store-batch-api-size-limit= https://www.pantsbuild.org/docs/reference-global#remote_store_batch_api_size_limit - does using that work?

Nope, that doesn't work.

I've had a long conversation in the past few days with the remote cache maintainer for pants and we tried debugging as much as we could. It SEEMS pants is doing everything correctly, It is respecting the remote_store_batch_api_size_limit and using the streaming api instead. We can't find where the ResourceExhausted message is coming from.

I tried setting

$ export GRPC_GO_LOG_VERBOSITY_LEVEL=99
$ export GRPC_GO_LOG_SEVERITY_LEVEL=info

before running bazel-remote to try to see any errors on the server side but it doesn't spit out anything.

What's weird is that when MaxRecvMsgSize is set to the default 4MB, pants doesn't seem to send any packages over the 4MB size when we inspect and decode the traffic with Wireshark.

Would love if you could enlighten us with some ideas on what to look for.

mostynb commented 1 year ago

It sounds like pants is sending a grpc message that is too large, and the grpc library on the bazel-remote side returns a ResourceExhausted error with the received message larger than max message: https://github.com/grpc/grpc-go/blob/master/rpc_util.go#L611

I don't think those GRPCGO* logging/verbosity settings will help much, because the logic you want to debug is on the client side. ie, why is pants sending such a large message in the first place?

Perhaps you could try to save and inspect this large grpc message? Is it possible to extract that from wireshark, or alternatively use some sort of grpc debug proxy to same the message? If you can figure out what kind of message it is, that might help narrow down where the message could be coming from in pants.

If you have a test case that reprodues this error, I could potentially try it out locally.

theoribeiro commented 1 year ago

I just tried forcing tonic to set a maximum message size on the client and I'm still getting the same error. I'll try to setup a repro for you this week. Hopefully you can take a look and try to help me figure out what's going on!

So far as a workaround I've been running a bazel-remote fork setting MaxRecvMsgSize without any further issues.

jasonwbarnett commented 6 months ago

@mostynb I dug deeper into this issue (documented here: https://github.com/pantsbuild/pants/issues/20674). I believe allowing folks to tune the gRPC server is valuable, even if it isn't absolutely necessary; as you pointed out, there are technically other ways to solve this on the client side.

I opened up https://github.com/buchgr/bazel-remote/pull/738 with a few tweaks to the work @theoribeiro did in their fork.