gcsfs doesn't properly handle gzipped files, ignoring content-encoding

jimmywan commented 2 years ago

I have a file "foo.txt.gz" that has been uploaded with the following metadata:

Content-Type: text/plain Content-Encoding: gzip

I'm trying to copy its contents to a new file in cloud storage that is uncompressed to workaround a bug where my tooling (gcloud) can't properly handle gzip input.

If I try to pass the compression flag on read, it complains about the file not being a gzip file, implying that transcoding is occurring:

with fs.open('gcs://jw-sandbox/uploads.txt.gz', 'rb', compression='gzip') as read_file:
...     with fs.open('gcs://jw-sandbox/uploads.txt', 'wb') as write_file:
...             shutil.copyfileobj(read_file, write_file)

gzip.BadGzipFile: Not a gzipped file (b'gs')

If I try to read the file without the compression flag and just dump contents to stdout, I only get the first N bytes of the decompressed contents where N is the compressed size:

with fs.open('gcs://jw-sandbox/uploads.txt.gz', 'rb', compression='gzip') as read_file:
...     for f in read_file:
...             print(f)

>>> print(gcsfs.__version__)
2022.02.0

mhfrantz commented 2 years ago

Is this a duplicate of #233?

martindurant commented 2 years ago

Right, the linked issue suggests that the best thing to do is not set the content-encoding, which relates to the transfer rather than the status of the remote file. The transfer is compressed anyway, which for a gzip file will make little difference either way. I believe this isn't how GCS should hand'e this, but nothing to be done about that.

martindurant commented 2 years ago

I wonder, what happens if you .cat() your file with start=10, end=20?

the-xentropy commented 2 months ago

Hey friends, I created a pull request which fixes this issue and won't need any code changes once it lands, as well as the variant. It was caused by us inheriting from AbstractBufferedFile and reporting the metadata 'size' to be the object size. AbstractBufferedFile makes a lot of assumptions about the size we returning being absolutely definitely correct and that ends up being used in a few different places (like all fsspec caches) to only fetch up until that byte range at most, which meant we truncated responses.

Basically, when GCSFilesystem open the GCSFile object it grabs the metadata size with an info() call, which ends up being propagated in the GCSFile's underlying readahead cache to limit the fetch request to the reported max size of the object (even the fall-back fsspec base cache which is not supposed to do anything but pass through requests has this issue), and this max file size which is the compressed file size ultimately ends up in _cat_file to set the Range: bytes=0-<compressed size> and we only fetch the compressed size even from the transitively decompressed data.

If you want to live on the edge in the meanwhile and are OK with functional but hacky solutions, you can jerryrig GCSFile's _fetch_range to pass end=None to cat_file() when end = self.size. This is going to break things if for some reason you're trying to get N decompressed bytes from a file which happens to be N bytes compressed in size, but in practice this is rare since almost all consuming of objects happens relative to the reported object size and not hypothetical decompressed content size and since most open()'s and read()'s are grabbing the whole file this will make everything from pandas to fsspec chug along happily in consuming filesystems.

 def _fetch_range(self, start=None, end=None):
        """Get data from GCS

        start, end : None or integers
            if not both None, fetch only given range
        """
        try:
           # look away, horrible hack
            if self.size == end:
                end = None
            return self.gcsfs.cat_file(self.path, start=start, end=end)
        except RuntimeError as e:
            if "not satisfiable" in str(e):
                return b""
            raise

fsspec / gcsfs

gcsfs doesn't properly handle gzipped files, ignoring content-encoding #461