geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
486 stars 45 forks source link

ENH: Add `from_wkt` and `from_wkb` tools #293

Closed giorgiobasile closed 2 months ago

giorgiobasile commented 3 months ago

Implements dask_geopandas.from_wkt and dask_geopandas.from_wkb, taking inspiration from dask_geopandas.points_from_xy.

Closes #208

jorisvandenbossche commented 3 months ago

@giorgiobasile Thanks for the PR!

One question we have to think about is the exact API: do we want to do this at the dataframe level, or at the (geo)series level? Right now you added a function that takes a dataframe and returns a GeoSeries. But if it returns a GeoSeries, we could also let it accept a Series (and the user can do the column selection themselves). Or otherwise let it accept and return a (Geo)DataFrame.

giorgiobasile commented 3 months ago

Thanks @jorisvandenbossche for your remarks!

One question we have to think about is the exact API: do we want to do this at the dataframe level, or at the (geo)series level? Right now you added a function that takes a dataframe and returns a GeoSeries.

Yes, the reason I sticked with the dask.dataframe -> gpd.GeoSeries interface was mainly for consistency with dask_geopandas.points_from_xy which I took inspiration from, as I needed to do something like:

gdf = dask_geopandas.from_dask_dataframe(
    ddf, geometry=dask_geopandas.from_wkt(ddf, geometry_wkt='wkt', crs="EPSG:4326")
)

Just double checking I'm getting what you are suggesting:

But if it returns a GeoSeries, we could also let it accept a Series (and the user can do the column selection themselves)

So this would change to allow for something like:

gdf = dask_geopandas.from_dask_dataframe(
    ddf, geometry=dask_geopandas.from_wkt(ddf["wkt"], crs="EPSG:4326")
)

Or otherwise let it accept and return a (Geo)DataFrame.

In this case the user would have a unified API, like:

gdf = dask_geopandas.from_wkt(ddf, geometry_wkt='wkt', crs="EPSG:4326")

Does this all make sense? Also, If we go for any of the latter two cases, should the interface be applied to dask_geopandas.points_from_xy as well, for sake of consistency?

jorisvandenbossche commented 2 months ago

Apologies for the slow follow-up!

Yes, the reason I sticked with the dask.dataframe -> gpd.GeoSeries interface was mainly for consistency with dask_geopandas.points_from_xy which I took inspiration from

Yes, that's a good point. points_from_xy has the constraint that it has multiple input columns, though, which we don't have for from_wkb, so we have a bit more flexibility.

But if it returns a GeoSeries, we could also let it accept a Series (and the user can do the column selection themselves)

So this would change to allow for something like:

gdf = dask_geopandas.from_dask_dataframe(
    ddf, geometry=dask_geopandas.from_wkt(ddf["wkt"], crs="EPSG:4326")
)

I think this feels the most natural API to me / most consistent with how it works in geopandas.

giorgiobasile commented 2 months ago

Apologies for the slow follow-up!

No problem!

I think this feels the most natural API to me / most consistent with how it works in geopandas.

Ok sure, I just updated it accordingly.

jorisvandenbossche commented 2 months ago

Thanks @giorgiobasile!

giorgiobasile commented 2 months ago

Thanks @giorgiobasile!

@jorisvandenbossche thank you for your suggestions!