geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
259 stars 22 forks source link

Fix reading from zstd decompression stream #443

Open ChristofKaufmann opened 2 days ago

ChristofKaufmann commented 2 days ago

Since 0.8.0 reading a zstandard compressed file does not work anymore. The following code works in 0.7.2, but not in 0.8.0 and 0.9.0:

import geopandas as gpd
import zstandard
import io

# make an uncompressed.geojson and a compressed.geojson.zst (fixture)
gdf = gpd.GeoDataFrame(geometry=[Point(4, 2), Point(4, 2)])

gdf.to_file('uncompressed.geojson', driver='GeoJSON')

with open('uncompressed.geojson', 'rb') as ro:
    content = ro.read()

compressed_content = zstandard.compress(content)
with open('compressed.geojson.zst', 'wb') as w:
    w.write(compressed_content)

# try to read geojson from BytesIO
# works in 0.7.2 and 0.9.0
with open('uncompressed.geojson', 'rb') as fh:
    mem = io.BytesIO(fh.read())

gdf = gpd.read_file(mem)
print(gdf)

# try to read geojson from file handle
# works in 0.7.2 and 0.9.0
with open('uncompressed.geojson', 'rb') as fh:
    gdf = gpd.read_file(fh)

print(gdf)

# try to read geojson from zstandard decompression stream
# works in 0.7.2, but not in 0.9.0
with open('compressed.geojson.zst', 'rb') as fh:
    dctx = zstandard.ZstdDecompressor()
    with dctx.stream_reader(fh) as reader:
        gdf = gpd.read_file(reader)

print(gdf)

The last test results in the stack trace:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/usr/local/lib/python3.11/dist-packages/geopandas/io/file.py", line 294, in _read_file
    return _read_file_pyogrio(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/geopandas/io/file.py", line 547, in _read_file_pyogrio
    return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyogrio/geopandas.py", line 252, in read_dataframe
    result = read_func(
             ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyogrio/raw.py", line 197, in read
    get_vsi_path_or_buffer(path_or_buffer),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyogrio/util.py", line 43, in get_vsi_path_or_buffer
    path_or_buffer.seek(0)
OSError: cannot seek zstd decompression stream backwards

A zstd decompression stream can be seeked only forwards, but not backwards. Hence its seekable method returns False. The issue has been introduced in #397.

This PR fixes it. Instead of only checking, whether the seek method exists, this PR checks whether the seekable method exists and returns True. Only then seek(0) is used.

Do you want me to add any test cases for this? If so, based on zstandard?

brendan-ward commented 2 days ago

Good catch, and sorry for introducing the bug.

I don't want us to add zstandard as a dependency, even for testing, so I created a very lightweight mock class that provides a non-seekable byte stream, and added it to the tests I'd want to see.

Subsequent read operations against the non-seekable bytes (also for the zstandard stream) will fail, but I'm assuming that is well expected for this use case.