geopandas / geopandas

Python tools for geographic data
http://geopandas.org/
BSD 3-Clause "New" or "Revised" License
4.49k stars 927 forks source link

ENH: In-memory zipped shapefile support #3379

Open egorbn opened 3 months ago

egorbn commented 3 months ago

Hi there,

I need to write a zipped shapefile to a BytesIO object, or, at the very least, to a TemporaryFile.

If a file path specified in to_file("some/file/path.shz") and this file does not exist, then indeed a zipped shapefile is written there. However, if that file exists (ie. in a context manager with open(), then regardless of suffix, to_file thinks it's a directory and writes accordingly.

So

gdf.to_file("some/file/path.shz")

creates a zipped shapefile (awesome!)

but

with open("some/file/path.shz", "w") as somefile:
    gdf.to_file(somefile.name)

Raises the following:

15:11:04.909 | ERROR   | fiona._env - `some/file/path.shz' not recognized as a supported file format.
15:11:04.911 | ERROR   | fiona._env - some/file/path.shz is not a directory.

It is possible to write to a BytesIO object from geopandas.to_file(f, driver="ESRI Shapefile"), but the result is a flat document that can't be opened and isn't a valid zipfile (same goes for providing a file like object to to_file for driver="ESRI Shapefile").

There are some discussions here but I'm not seeing the "convenience" method or parameter to specify a zipped shapefile explicitly:

          For a shapefile you can already do `gdf.to_file("path.shz") ` as the shz extension is interpreted as zipping the shapefile by gdal

Originally posted by @m-richards in https://github.com/geopandas/geopandas/issues/2200#issuecomment-2192685420

m-richards commented 3 months ago

@egorbn geopandas doesn't support writing to zip files directly, the support we have is only that which natively comes through gdal. The pull request you linked was not merged (and I don't think it would help here regardless, because it was zipping based on filename), and in any case it's 3 years out of date)

There are a compute of things going on here

It is possible to write to a BytesIO object from geopandas.to_file(f, driver="ESRI Shapefile"), but the result is a flat document

In all cases this would not be a zip file, as the zip is only inferred by filename, and there isn't a filename for bytesio. For pyogrio, writing to in memory shapefiles is explicitly disallowed for the moment. With engine='fiona', this doesn't seem to work correctly, at least for shapefiles (gpkg seems to work).

with open("some/file/path.shz", "w") as somefile:
    gdf.to_file(somefile.name)

is not going to work because you've opened a file handle and then fed geopandas a string filename - it has no knowledge of the file handle based upon the inputs.

egorbn commented 3 months ago

@egorbn geopandas doesn't support writing to zip files directly, the support we have is only that which natively comes through gdal. The pull request you linked was not merged (and I don't think it would help here regardless, because it was zipping based on filename), and in any case it's 3 years out of date)

Thanks, @m-richards, for the quick response! Yes I see that, I was linking to provide a bit of context, since that discussion was along similar lines, and still seems to have at least some relevance.

There are a compute of things going on here

It is possible to write to a BytesIO object from geopandas.to_file(f, driver="ESRI Shapefile"), but the result is a flat document

In all cases this would not be a zip file, as the zip is only inferred by filename, and there isn't a filename for bytesio. For pyogrio, writing to in memory shapefiles is explicitly disallowed for the moment. With engine='fiona', this doesn't seem to work correctly, at least for shapefiles (gpkg seems to work).

I am using fiona and that's what I found -- I could write a shapefile to BytesIO, but it isn't valid and can't be read.

However, the crux of my question is exactly about the zipping being inferred from the filename. Is it possible to trigger that condition through an argument, rather than through the filename?

with open("some/file/path.shz", "w") as somefile:
    gdf.to_file(somefile.name)

is not going to work because you've opened a file handle and then fed geopandas a string filename - it has no knowledge of the file handle based upon the inputs. Makes, sense. I was trying to trick the code into triggering the zipping condition by providing a filename, but also controlling the file-like object.

In any case, the best solution would be just to have an argument to to_file() to trigger the shapefile being zipped. This way I can have full control of the file-like target, but also explicitly request zipping, rather than have to rely on the filename.

The next best thing is this solution, which I am currently using:

with TemporaryDirectory() as tmpdir:
    filename = f"{tmpdir}/awesome_features.shz"
    gdf.to_file(filename)
    with open(filename, "rb") as zippedshp:
        return zippedshp.read()

This way I can still return bytes, but also guarantee that there will be nothing persisted on the filesystem. This last piece is important for my application and is one of the reasons for working with in-memory file-like objects.

m-richards commented 3 months ago

Having a little bit more of a dig it looks like this isn't supported directly in fiona either, write support is only available for memoryfiles and not zipped memoryfiles https://github.com/Toblerity/Fiona/blob/6534ef74d965e8f9fe760cecf8f022f518b475a7/tests/test_memoryfile.py#L223. You can try and see if you can manage this by feeding /vsi schemes directly a la https://github.com/geopandas/geopandas/pull/2200#issuecomment-1003367965. In principle you can chain /vismem together with /vsizip https://gdal.org/user/virtual_file_systems.html#chaining but I've not managed to get that to work in my local testing.

brendan-ward commented 3 months ago

We're tracking this functionality in pyogrio #402; though we have no immediate plans to implement it.

The reason that shapefiles are invalid when using the in-memory file (/vsimem/ via BytesIO) is that there are multiple files involved, and GDAL treats the vsimem file we create as a directory instead, so we'd then return the raw bytes of multiple files as a single stream with no way for the caller to know which bytes go to which file. This is why we specifically disabled the two well-known multi-file drivers: shapefile and file geodatabase. As per the issue above (and comments above), if we had a way to instruct the backend to instead create a zipped file, we could then use some of the GDAL VSI file system functions to walk the directory it creates in the vsimem file and add those to the output zip file - so we could return a single zipped file as bytes. I'd prefer to keep that via an argument (e.g., zipped=True) rather than inferring based on filename - which isn't passed in when using BytesIO anyway, and then you can give the output bytes whatever filename you want when you handle those.

My understanding is that chaining the /vsizip handler works for reading, but not for writing.