geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
268 stars 22 forks source link

BUG: geojson is not read correctly with geopandas>=1.0.0 #445

Open veenstrajelmer opened 1 month ago

veenstrajelmer commented 1 month ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import geopandas as gpd
uhslc_gpd = gpd.read_file("https://uhslc.soest.hawaii.edu/data/meta.geojson")
time_min = uhslc_gpd["fd_span"].apply(lambda x: x["oldest"])

Problem description

The above code raises "TypeError: string indices must be integers, not 'str'" in geopandas>=1.0.0. For older versions the code runs successfully. The issue is that the column now contains strings with dicts instead of plain dicts. It seems that something goes wrong with the parsing of the geojson.

Expected Output

A subset of the original column.

Output of geopandas.show_versions()

SYSTEM INFO ----------- python : 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:29:11) [MSC v.1935 64 bit (AMD64)] executable : C:\Users\veenstra\Anaconda3\envs\dfm_tools_env\python.exe machine : Windows-10-10.0.19045-SP0 GEOS, GDAL, PROJ INFO --------------------- GEOS : 3.11.2 GEOS lib : None GDAL : 3.8.5 GDAL data dir: C:\Users\veenstra\Anaconda3\envs\dfm_tools_env\Lib\site-packages\pyogrio\gdal_data\ PROJ : 9.3.0 PROJ data dir: C:\Users\veenstra\Anaconda3\envs\dfm_tools_env\Lib\site-packages\pyproj\proj_dir\share\proj PYTHON DEPENDENCIES ------------------- geopandas : 1.0.0 numpy : 1.26.4 pandas : 2.2.2 pyproj : 3.6.1 shapely : 2.0.2 pyogrio : 0.9.0 geoalchemy2: None geopy : 2.4.1 matplotlib : 3.8.4 mapclassify: None fiona : 1.9.5 psycopg : None psycopg2 : None pyarrow : None
martinfleis commented 1 month ago

Thanks for the report! I can confirm that with the new default IO engine pyogrio, this indeed returns a string.

A workaround is to use the old engine that was default pre 1.0.

uhslc_gpd = gpd.read_file("https://uhslc.soest.hawaii.edu/data/meta.geojson", engine="fiona")

@brendan-ward will know more whether this is expected or something we need to process differently in pyogrio.

veenstrajelmer commented 1 month ago

@martinfleis thanks a lot for this useful suggestion, this conveniently solves the issue I had at least on my side. However, the engine string seems to be case sensitive, so it should be engine='fiona'.

martinfleis commented 1 month ago

I'll keep this open and move it to pyogrio as we may want to look into that there.

brendan-ward commented 1 month ago

It looks like there is a field type OFSTJSON that Fiona is using in this case to automatically convert to dict, and on write, automatically convert dict / list values when serializing.

On the Pyogrio side, we need to detect this subtype and carry through that info when deserializing / serializing fields. Serializing is likely to be harder because the numpy array dtype does not give us this info - so there may be a real performance penalty there (or we leave this the responsibility of the user).

For now, you could also manually parse applicable fields to dict and still get the speedups of Pyogrio:

import json

uhslc_gpd = gpd.read_file("https://uhslc.soest.hawaii.edu/data/meta.geojson")
uhslc_gpd["rq_span"] = uhslc_gpd.rq_span.apply(json.loads)
veenstrajelmer commented 1 month ago

Thanks for the suggestion. That would also work indeed, but "rq_span" is not the only field that requires conversion, so for my application I prefer the fiona approach for now.