geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
498 stars 44 forks source link

ENH: extending the basic spatial join #72

Open jorisvandenbossche opened 3 years ago

jorisvandenbossche commented 3 years ago

https://github.com/geopandas/dask-geopandas/pull/54 added a basic spatial join function. It's a naive cross product of all partitions of both left and right (if no spatial partitioning information is available) or all partition combinations that intersect (based on the spatial partitioning information).

This works fine for an inner join and when your spatial partitions are already reasonable. There are however several aspects still to consider.

Supporting a left join

So currently we only allow how="inner", as for this case the naive version works nicely. However, a "left" join becomes more complicated. Suppose we have a certain partition of the left frame that intersects with two partitions of the right frame. We perform a normal spatial join (using geopandas.sjoin) for each of those two combinations, resulting in two output partitions. But when doing a left join, geopandas.sjoin preserves the rows of the left dataframe that don't match with a geometry of the right dataframe. But so if we do this multiple times, it means that rows of the left dataframe can end up in multiple output partitions (instead of only in one of the output partitions, as is the case for an inner join). As a result, we would end up with a dask GeoDataFrame with duplicated rows.

Improving the output (spatial) partitions

Because the current implementation combines each partition of the left frame with each (intersecting) partition of the right dataframe, the result inherently has more and smaller partitions in the output.
Suppose again we have a certain partition of the left frame that intersects with two partitions of the right frame. Each of those two combinations is currently a task in the graph that ends up as a new partition in the output dataframe.

Overall (and depending on the exact (spatial) partitioning of the input), this can lead to a "fragmented" output dask GeoDataFrame with many smaller partitions, which could negatively impact (the performance of) further computations. It might be interesting to look into "merging" chunks again (eg all output partitions coming from a certain partition of the left input could be combined (concatted) into a single output partition).

Repartition the input before performing a spatial join?

For pre-partitioned left and right datasets, the currently implemented approach (joining of geometries by partition resulting from intersection of partitions) is probably fine. But there will also be cases where it could be interesting to repartition input before performing the join.

One example case (copying from @brendan-ward mail conversation): for a target dataset (assume this has the greater number of features) that has already been partitioned, partition the query dataset to match the target (which, I guess, is the same as above, just that query dataset is a partition size of 1) - which may involve cutting the query dataset by the bounding boxes of the intersecting partitions.

jorisvandenbossche commented 3 years ago

Supporting a left join ... when doing a left join, geopandas.sjoin preserves the rows of the left dataframe that don't match with a geometry of the right dataframe. But so if we do this multiple times, it means that rows of the left dataframe can end up in multiple output partitions

One way to "relatively easy" do this, is to always do at least one "left join" for each partition of the left dataframe, and then inner joins for any additional sjoin calls for that same partition. (unfortunately, this would still give potentially duplicated rows)

alxmrs commented 8 months ago

Hello! What's the priority on this feature request? I'm interested in performing left joins (a left outer join) with this library.

martinfleis commented 8 months ago

@alxmrs we don't really have a capacity to work on dask-geopandas beyond maintenance these days. I'd be happy to review a PR if someone wants to give it a go but it is unlikely that any of us will try to implement left joins anytime soon.

alxmrs commented 8 months ago

Hey Martin. That is totally fair. I might be interested in submitting a PR later this year. Before I do, may I reach out to someone on your team for pointers?

On Wed, Feb 28, 2024 at 5:07 PM Martin Fleischmann @.***> wrote:

@alxmrs https://github.com/alxmrs we don't really have a capacity to work on dask-geopandas beyond maintenance these days. I'd be happy to review a PR if someone wants to give it a go but it is unlikely that any of us will try to implement left joins anytime soon.

— Reply to this email directly, view it on GitHub https://github.com/geopandas/dask-geopandas/issues/72#issuecomment-1968641376, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARXABZG2PVDQSN7QYDEPXTYV36UZAVCNFSM5AAGS7C2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWHA3DIMJTG43A . You are receiving this because you were mentioned.Message ID: @.***>

martinfleis commented 8 months ago

Sure, if you have anything to discuss leave it in this issue.