Error, message length too large: found 7666438 bytes, the limit is: 4194304 bytes

apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine

https://datafusion.apache.org/ballista

Apache License 2.0

1.46k stars 185 forks source link

Error, message length too large: found 7666438 bytes, the limit is: 4194304 bytes #773

Closed andygrove closed 9 months ago

andygrove commented 1 year ago

Describe the bug

I tried running some benchmarks, but some queries fail with this error:

2023-05-14T16:00:52.679602Z  WARN tokio-runtime-worker ThreadId(47) ballista_executor::execution_loop: Executor poll work loop failed. If this continues to happen the Scheduler might be marked as dead. Error: status: OutOfRange, message: "Error, message length too large: found 7666438 bytes, the limit is: 4194304 bytes", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Sun, 14 May 2023 16:00:52 GMT"} }

To Reproduce

Start cluster:

./target/release/ballista-scheduler
./target/release/ballista-executor -c 24

Run TPC-H benchmarks

Expected behavior Should not fail

Additional context

yahoNanJing commented 1 year ago

Hi @andygrove, we also meet the same issue. I will propose a PR to add a config to make the maximum decoded message size configurable for temporary fix.

andygrove commented 9 months ago

I am still running into this error with the latest code.

2023-12-11T14:31:18.347839Z  WARN          task_runner ThreadId(82) ballista_executor::cpu_bound_executor: Spawned task output ignored: receiver dropped    
2023-12-11T14:31:18.484649Z  WARN tokio-runtime-worker ThreadId(45) ballista_executor::execution_loop: Executor poll work loop failed. If this continues to happen the Scheduler might be marked as dead. Error: status: OutOfRange, message: "Error, message length too large: found 7700152 bytes, the limit is: 4194304 bytes", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Mon, 11 Dec 2023 14:31:18 GMT"} }

I am using the default --grpc-server-max-decoding-message-size size of 16 MB, but the limit still appears to be 4 MB.

andygrove commented 9 months ago

We currently set the decoding max size but not the encoding max size, so perhaps that is the issue. I will test this.

Dandandan commented 9 months ago

We've hit some other errors related to max sizes at our end (Coralogix), we reduced those errors by:

increasing max size limits (for execution plan / flight message (batches))
reducing batch size (batches can be really big at Coralogix because of containing long strings)
enabling flight compression

Dandandan commented 9 months ago

Some other things we did:

we have an optimization rule that will remove unused partitions (for the task) before sending it to the executor PruneUnusedPartitions, as our plans can contain 1000s of partitions.
AFAIK one thing we can also do and don't do yet is enabling (gzip) compression on the GRPC API to reduce the size of sending the execution plan.

andygrove commented 9 months ago

I confirmed that setting the max encoding size resolves the issue for me.

andygrove commented 9 months ago

We set max encode/decode message size when creating the gRPC servers, but not for the clients, so I ran into this again.