We should have an option to spatially partition data before writing to GeoParquet. At this stage, we should do it entirely in-memory, which we can relax at a later date. This would also be a good precursor to a DataFusion-based extension further down the line.
We may want to relax the bounds on Sort. Right now it requires the total bounds of all input boxes before sorting. But only HilbertSort requires the global input bounding box. STRSort doesn't use it; it only needs the number of items and the node size. It would be nice to update that API so that STRSort doesn't need that input. (Maybe refactor to a different method, outside the trait, and then we can call STRSort::sort() manually without going through the trait)
It would also be nice to split the sorting and have a lower level API to sort only the raw boxes and not the higher level nodes, because in GeoParquet we don't have higher-level nodes.
Note here that we want to sort across multiple chunks; we don't want to solely sort within each input chunk. So we'll presumably want to allocate an arange for all rows across all chunks.
[ ] Partitioning:
Then we want to handle a "chunked take"/partitioning/rechunking across all those input chunks, given some
[ ] Writing bounding box column.
We should have a way to write the bounding box column of GeoParquet 1.1. Should we only write this when we're writing WKB geometries? It's unclear, because it can also be beneficial to use the bounding box column when the native geometries are very large, because the bbox should be way smaller than big polygons.
Note also that the result of geo-index's Sort should give you the sorted bounding boxes. So you shouldn't need to compute the bboxes again.
We should have an option to spatially partition data before writing to GeoParquet. At this stage, we should do it entirely in-memory, which we can relax at a later date. This would also be a good precursor to a DataFusion-based extension further down the line.
Steps:
Sort
trait of geo-index (e.g. https://github.com/kylebarron/geo-index/blob/main/src/rtree/sort/str.rs)Sort
. Right now it requires the total bounds of all input boxes before sorting. But onlyHilbertSort
requires the global input bounding box.STRSort
doesn't use it; it only needs the number of items and the node size. It would be nice to update that API so that STRSort doesn't need that input. (Maybe refactor to a different method, outside the trait, and then we can callSTRSort::sort()
manually without going through the trait)arange
for all rows across all chunks.cc @dahnj