Open martinfleis opened 4 years ago
One question that comes up here (and it's the same question for spatial repartitioning a dask.dataframe to conform to given regions): how to deal with possible duplicates?
If you specify the regions for the spatial partitions and use ST_INTERSECTS
, a given geometry can intersect with multiple regions.
In @mrocklin's prototype, he had some special code to check the representative point of a geometry instead, to avoid such duplicates: https://github.com/mrocklin/dask-geopandas/blob/8133969bf03d158f51faf85d020641e86c9a7e28/dask_geopandas/core.py#L339-L406
I was hoping that this will be resolved in a spatial repartitioning bit and we could just use the same principle here :).
Using an intersection with a representative point instead of geometry is one way, which would resolve a lot of cases. But not all, since you can have input points fixed to some grid and use the same grid for partitioning. Then one point could be intersecting the edge of 2 (or even 4) parts, so not even that is robust enough.
I would probably try the following:
ST_INTERSECTS
would be used)query_bulk
than it is easy)I am just afraid that it can be expensive.
I am just afraid that it can be expensive.
When starting from an existing dask.dataframe (not necessarily in memory, can also be backed by reading in from eg parquet), doing a repartition is expensive anyway (since it's doing a full re-shuffle of the data), so the additional checks for intersects / duplicates might not be that of a problem. And for this case, your points above seem like a good workflow (and I think we can probably simply keep the duplicates in one of the partitions, no matter which one, without using the clever touch/shift trick).
For reading from PostGIS, it's a bit different though. Because in this case ideally the queries you do from postgis directly give you the correct data for the partitions of the GeoDataFrame, I think? (so we wouldn't use query_bulk
on those data afterwards to check if the partitioning is correct) In which case the "logic" that determines which rows to select should live in the SQL query?
Are we actually able to put this logic to the SQL query? If we can get "clean" chunks from SQL which can be directly mapped to partitions, that is ideal. That would not be a simple query. That is why I was mentioning that it can be expensive, as I imagined dumb reading from PostGIS and filtering on the dask side (which would require unique id coming from PostGIS).
Are we actually able to put this logic to the SQL query?
If we cannot (and thus doing the "dumb reading from PostGIS and filtering/repartitioning on the dask side"), I am not sure how useful the method would be.
Repartitioning will typically be something you want to do once, and then try to "persist" the result or write the result to disk using a more efficient format (and a format that preserves the spatial partitioning information). Certainly in case the PostGIS table is larger than memory, doing the full shuffle each time when reading + doing a calculation will not be very efficient, I think.
I have had good experience in letting postgis handle the partitioning.
In general, the partitioning should be done right at the IO level, either by having the data preprocessed in a partitioned structure, or by having an index on the data that allows efficient spatial queries (and thus spatial partitioning). The PostGIS database falls into the second category.
When using the standard GIST index on geometry, the most efficient query would be “SELECT (...) WHERE geom_col && partition_box”. So not using ST_Intersects, which might become expensive for complex geometries.
You could then add a filter with a representative point and check if it is in the bbox (half-open intervals) to make this unique in all cases. That is a rather efficient (but verbose) addition to the query. You could also do the index “ON ST_Centroid(geometry)”.
However in my experience it is often faster to accept some duplicates and filter them out when the partitions are merged. The duplicates give some superfluous computations, but this is an edge effect. For many spatial analyses you actually need the duplicates to be there.
Another issue with an external database is: how to choose the partitions without doing a full scan through the data? I always did some square grid and supplied the extent of it myself, which mostly results in a big imbalance between partition sizes. This is certainly not a show-stopper. When comparing geometries with array tiles, this is actually a good idea.
For more balanced partitions: maybe we could leverage the internal index structure of the PostGIS R-tree?
Once there is a spatial partitioning backend in place, we could read chunks from PostGIS via SQL (
ST_Intersects
or something).We may also be able to link it to
chunksize
parameter and read into partitions directly (I think that adaptingdask.dataframe.read_sql_table
to handle geometries would do it), but spatially constrained reading would be better.