Dataflow uses incorrect full file size with GS file using Content-Encoding: gzip

GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

http://cloud.google.com/dataflow

855 stars 324 forks source link

Dataflow uses incorrect full file size with GS file using Content-Encoding: gzip #517

Open rfevang opened 7 years ago

rfevang commented 7 years ago

To reproduce:

Upload a simple file (10000 sequential numbers, one per line) to Google storage specifying GZIP compression gsutil cp -Z numbers.txt gs://<bucket>/numbers.txt.

Execute a simple dataflow just reading, then writing these numbers:

p.apply(TextIO.Read.from("gs://<bucket>/numbers.txt"))
.apply(TextIO.Write.to("gs://<bucket>/out").withSuffix(".txt"));

Expected: Either all 10000 numbers written, or alternately gibberish written (raw compressed data). Actual: A subset of numbers written (1-4664). Looks like it reads the decompressed file as if its size was that of the file before decompression.

Specifying GZIP decompression mode works as expected (all 10000 numbers written):

p.apply(TextIO.Read.from("gs://<bucket>/numbers.txt")
       .withCompressionType(CompressionType.GZIP))
 .apply(TextIO.Write.to("gs://<bucket>/out").withSuffix(".txt"));

davorbonaci commented 7 years ago

Thanks @rfevang for this detailed report, much appreciated.

dhalperi commented 7 years ago

My understanding is that this is a fundamental limitation of GCS's encoded-type format.

TextIO.Read uses file extension to determine whether a file is compressed, and .txt says it is not.
We stat the file, and GCS gives us the compressed size.
We use the GCS client libraries to download the file, and they serve us the uncompressed bytes (they transparently decompress them and we have no way to disable this).
We trust the file size and read only the prefix of the decompressed stream.

So I think this is working as intended -- simply should not use that mode with GCS unless you force GZIP compression. You arrived at exactly the right solution.

What should be happening, I think, is that the issue is in 3. If the bytes were not transparently decompressed, we would get the right number of compressed bytes. Then TextIO would serve garbage and the user would notice, and properly set the GZIP compression flag to force decompression.

rfevang commented 7 years ago

It's a bit nasty to have it work this way though. It looks as if everything works correctly (depending on the type of file you're processing).

Considering that cloud storage encourages you to upload text files in this way (and TextIO only supports text files), I really think that throwing an exception or yielding compressed data needs to happen if decompression isn't possible. Is there no way to look at the content-encoding ahead of time and do the right thing?

Also, how does TextIO get the non-decompressed bytes when specifying GZIP encoding manually? It would need to get the data in a different way somehow so it doesn't try to decompress the already decompressed bytes right?

dhalperi commented 7 years ago

I think you're right @rfevang . I'll try to follow up. Right now, Google Cloud Storage just looks like a filesystem that lies to us about its file size, but perhaps we can catch this in a different way or push an upstream behavior change.

dhalperi commented 7 years ago

Leaving this bug open to track.

johnbhurst commented 5 years ago

I just hit this issue. Thank goodness I found this bug report, because I might never have figured it out myself. (Thanks @rfevang!)

Is there anything that can be done? Perhaps at least add an item for this in Troubleshooting Your Pipeline (https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline#Errors), or is this error too uncommon for that?