geopandas / community

GeoPandas Community Docs and Discussions
6 stars 0 forks source link

Dask Summit 2021 - "Scaling geospatial vector data" workshop #4

Open jorisvandenbossche opened 3 years ago

jorisvandenbossche commented 3 years ago

During the Dask Summit, we have a 2-hour workshop scheduled about scaling geospatial vector data on Thursday May 20th at 11-13:00 UTC (https://summit.dask.org/schedule/presentation/22/scaling-geospatial-vector-data/)

We can use this issue to further gather ideas and discuss the exact content of the workshop.

Workshop abstract:

The geospatial Python ecosystem provides a nice set of tools for working with vector data, including Shapely for geometry operations and GeoPandas to work with tabular data (and many other packages for IO, visualization, domain specific processing, …). One of the limitations of those core tools is a sub-optimal performance and limited scaling possibilities.

Over the last years, effort has been put in improving the performance through vectorized interfaces to GEOS, the underlying C library of Shapely. In turn, that enables releasing the GIL and makes the Dask - GeoPandas combination more interesting. GeoPandas is an extension to the pandas DataFrame, and thus how Dask scales pandas can be applied on GeoPandas as well. Initial effort to build a bridge between Dask and GeoPandas is currently taking the shape of the dask-geopandas library.

Also other interesting efforts in this space are popping up. The SpatialPandas package provides alternative pandas and Dask extensions for vectorized spatial and geometric operations. Libraries such as datashader and pydeckgl can be used to visualize larger spatial datasets.

This workshop will give a brief overview of some of the packages and ongoing efforts, and provide a place to discuss further improvements and interoperability between the libraries, with an emphasis on the conceptual design of distributed computation on inherently unpredictable vector data.

More detailed agenda:

cc @martinfleis @jsignell

jorisvandenbossche commented 3 years ago

Posting my initial brainstorm list of possible topics here:

martinfleis commented 3 years ago

Thanks for starting this!

I would leave GPU out of the discussion for now. The situation there is very different at the moment and it would probably require its own introduction and discussion topics, not necessarily linked to dask.

I would like to spend a reasonable amount of time on spatial partitioning and overlapping computations because figuring out this bit properly is key in my eyes. It is not straightforward task at all because one approach needs to be used for postcode zones (contiguous compact polygons) and another one for, say, linestring trajectories.

Agree on IO. I guess that PostGIS links will be more important in dask-geopandas than they're in geopandas.

We can touch visualisation while talking about spatialpandas, since that is used as a direct interface to datashader. (As a side note, it may be useful to work out dask-based conversion between dask-geopandas and spatialpandas geometries.)

jorisvandenbossche commented 3 years ago

I updated the top post with a summary of what we discussed yesterday (and to be completed if people confirm)

martinfleis commented 3 years ago

Should we maybe switch use cases and my bit on partitioning and indexing? That way I can try to summarise them and open the floor for the main discussion in which we can reflect on real-life use cases along the way.

edit: I switched it above

jorisvandenbossche commented 3 years ago

Sounds good. I am only wondering if we then should also move spatialpandas to just before your talk (since it will mainly touch on the spatial partitioning / hilbert curve for repartitioning) ? Although on the other hand it also fits after my dask-geopandas explanation.

martinfleis commented 3 years ago

I'd leave it where it is to cover the existing packages first.