apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

DataflowRunner does not scale when reading gzip file #19373

Open kennknowles opened 2 years ago

kennknowles commented 2 years ago

Hi,

I have a pipe that ReadFromText() a 700mb gz file from a GS bucket.

It then parse json, create BigQuery row, and WriteToBigQuery.

The pipeline above does not scale. If I specify 2 workers on startup it will scale it down to 1 and the throughput remains the same. The job takes 30 minutes.

 

What I found is that the exact same pipeline, reading the same but uncompressed 11gb file from the same location scales very well. The job only takes 5 minutes.

 

Imported from Jira BEAM-7094. Original Jira may contain additional context. Reported by: moander2.

mareksuscak commented 11 months ago

Cloud Storage does not support range requests when the files are transcoded using the built-in on-the-fly transcoding feature. I did some research a while ago, and while I am not 100% sure now, I vaguely remember that I ultimately concluded that this was the main culprit. Plain text files are likely splittable because individual Beam workers can use range requests to request a part of the file.