InseeFrLab / pynsee

pynsee package contains tools to easily search and download french data from INSEE and IGN APIs
https://pynsee.readthedocs.io/en/latest/
MIT License
67 stars 8 forks source link

geodataframe parquet storage exception #202

Open tgrandje opened 1 month ago

tgrandje commented 1 month ago

Hello @hadrilec @tfardet

There seems to be a bug on the parquet storage for geodataframes: geodataframes are not rightly handled by pandas/pyarrow.

It seems a simple try/except and geopandas overlay might do the trick. I'll check and propose a PR.

Sample code:

from pynsee.geodata import get_geodata
cities = get_geodata("ADMINEXPRESS-COG-CARTO.LATEST:commune")

The parquet file will not be written in the cache dir, even if pynsee says it was saved.

The warning was also not displayed and the culprit seems to be here. Do any of you remember why this was added in the first place?

(Sorry I'm not present on pynsee anymore, I've had too much on my plate. Hopefully it will be better before the end of this year)

tgrandje commented 1 month ago

This seems to be a clear use case for adding geopandas as a dependency : https://github.com/InseeFrLab/pynsee/issues/169

tfardet commented 1 month ago

This warning filter was before my time, but indeed, I just noticed the issue just today as well ^^ I'm also obviously 100% in favor of adding geopandas as dependency!

hadrilec commented 1 month ago

I would like to avoid to have geopandas as a dependency because I struggled once to install this package on my laptop. If I remember correctly, this package comes with special external dependencies. I dont see why it is absolutely needed to include it as a dependency.

tgrandje commented 1 month ago

Was this recent? I don't recall having any trouble recently. At least not since Christoph Gohlke's alternative repo with built whl went (almost) down and the team behind Fiona was forced to compile the whl for windows directly on pypi... (And believe me, I was the first to struggle with issues on windows server at that time.)

I'd say the compromise would be to declare geopandas as an optional dependency. If geopandas is present, then we can write and read geoparquets files. If not, we resort to the "old" pickle way and set a warning to the user.

tfardet commented 1 month ago

I would like to avoid to have geopandas as a dependency because I struggled once to install this package on my laptop

As I've said before, I also don't think this is relevant anymore since geopandas has been using wheels for a long time.

I believe anyone who receives geodata in python as something that looks like a dataframe would (rightfully) expect a GeoDataFrame and handing them something that is not and does not even behave like one is not a good idea.

EDIT: if the GeoFrDataFrame is also the cause for the geoparquet issue, that's also another very good reason to get rid of it