geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
258 stars 22 forks source link

Question: recommended way to load columns containing SRID info? #356

Open codeananda opened 5 months ago

codeananda commented 5 months ago

I've got some csv files where the geometry column is named 'geom' and looks like this

SRID=27700;POLYGON()
SRID=27700;POLYGON((10 10, 20 20, 30 10, 10 10))
SRID=27700;MULTIPOLYGON()

To load it, I manually check and modify the columns but wonder if there is a way for pyogrio to handle this itself?

Some things I've tried

My working code

import pandas as pd
import geopandas as gpd
from shapely import wkt
from io import StringIO

# What I actually do
shlaa_gdf = gpd.read_file(
    "SHLAA.csv",
    engine='pyogrio',
    encoding='utf-8',
)

# Toy example
csv_content = """fid,geom
1,"SRID=27700;POLYGON()"
2,"SRID=27700;POLYGON((10 10, 20 20, 30 10, 10 10))"
3,"SRID=27700;MULTIPOLYGON()" """

shlaa_gdf = pd.read_csv(StringIO(csv_content))
shlaa_gdf = gpd.GeoDataFrame(shlaa_gdf)

def load_to_geom(x):
    if x == 'POLYGON()':
        return wkt.loads('POLYGON EMPTY')
    elif x == 'MULTIPOLYGON()':
        return wkt.loads('MULTIPOLYGON EMPTY')
    else:
        return wkt.loads(x)

if 'geom' in shlaa_gdf.columns:
    split_df = shlaa_gdf['geom'].str.split(';', expand=True)
    shlaa_gdf['srid'] = split_df[0]
    shlaa_gdf['wkt_geom'] = split_df[1]

    shlaa_gdf['geometry'] = shlaa_gdf['wkt_geom'].apply(load_to_geom)
    shlaa_gdf = shlaa_gdf.set_geometry('geometry')
martinfleis commented 5 months ago

This is EWKT, or Extended Well-Known Text representation of geometries. With the exception of empty geometries, which are invalid here. I can't seem to find a reference to it in GDAL docs so I suppose that it supports only standard WKT, which is everything after ;.

We could potentially try to detect it, parse it and create geometry column, assuming all SRID's are the same, though it is not supported at the moment. Though you'd need to ensure that POLYGON() is not present as that is incorrect WKT.

My recommendation for now is to use what you have, maybe trying to use vectorized shapely.from_wkt rather than wkt.loads but that is a minor difference.

theroggy commented 5 months ago

For information, as mentioned by @martinfleis , if you would have to process larger datasets it would probably be more efficiënt to use only vectorized functions.

E.g. like this:

import pandas as pd
import geopandas as gpd
import shapely
from io import StringIO

# Toy example
csv_content = """fid,geom
1,"SRID=27700;POLYGON()"
2,"SRID=27700;POLYGON((10 10, 20 20, 30 10, 10 10))"
3,"SRID=27700;MULTIPOLYGON()"
"""

shlaa_df = pd.read_csv(StringIO(csv_content))

shlaa_df[["srid", "wkt"]] = shlaa_df["geom"].str.split(";", expand=True)
shlaa_df.loc[shlaa_df["wkt"] == "POLYGON()", "wkt"] = "POLYGON EMPTY"
shlaa_df.loc[shlaa_df["wkt"] == "MULTIPOLYGON()", "wkt"] = "MULTIPOLYGON EMPTY"
crs = shlaa_df.iloc[0]["srid"].split("=")[1]

shlaa_gdf = gpd.GeoDataFrame(geometry=shapely.from_wkt(shlaa_df["wkt"]), crs=crs)

print(shlaa_gdf)
print(shlaa_gdf.crs)
codeananda commented 5 months ago

Amazing thank you!

@theroggy could you write 'python' after the triple backticks? It's so much nicer to read.