Open rfevang opened 7 years ago
Thanks @rfevang for this detailed report, much appreciated.
My understanding is that this is a fundamental limitation of GCS's encoded-type format.
.txt
says it is not.So I think this is working as intended -- simply should not use that mode with GCS unless you force GZIP compression. You arrived at exactly the right solution.
What should be happening, I think, is that the issue is in 3. If the bytes were not transparently decompressed, we would get the right number of compressed bytes. Then TextIO would serve garbage and the user would notice, and properly set the GZIP compression flag to force decompression.
It's a bit nasty to have it work this way though. It looks as if everything works correctly (depending on the type of file you're processing).
Considering that cloud storage encourages you to upload text files in this way (and TextIO only supports text files), I really think that throwing an exception or yielding compressed data needs to happen if decompression isn't possible. Is there no way to look at the content-encoding ahead of time and do the right thing?
Also, how does TextIO get the non-decompressed bytes when specifying GZIP encoding manually? It would need to get the data in a different way somehow so it doesn't try to decompress the already decompressed bytes right?
I think you're right @rfevang . I'll try to follow up. Right now, Google Cloud Storage just looks like a filesystem that lies to us about its file size, but perhaps we can catch this in a different way or push an upstream behavior change.
Leaving this bug open to track.
I just hit this issue. Thank goodness I found this bug report, because I might never have figured it out myself. (Thanks @rfevang!)
Is there anything that can be done? Perhaps at least add an item for this in Troubleshooting Your Pipeline (https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline#Errors), or is this error too uncommon for that?
To reproduce:
Upload a simple file (10000 sequential numbers, one per line) to Google storage specifying GZIP compression
gsutil cp -Z numbers.txt gs://<bucket>/numbers.txt
.Execute a simple dataflow just reading, then writing these numbers:
Expected: Either all 10000 numbers written, or alternately gibberish written (raw compressed data). Actual: A subset of numbers written (1-4664). Looks like it reads the decompressed file as if its size was that of the file before decompression.
Specifying GZIP decompression mode works as expected (all 10000 numbers written):