Function filter fragments returns the whole dataset

nagyrobir commented 1 year ago

Hi! Thank you for the work you do for the community. I don't know how far along this project is, but i have installed the libraries. I imported a geoparquet (one generated with QGIS the other with geopandas) and it reads it in. I might have understood it wrong from the code but would the geometry input from "filter_fragments" method from the GeoDataset class be used as a filter so that you read in only the data that intersects the bounding box of the input from "filter_fragments"? If that is so, i have tried the following

`import geopandas as gpd import pyarrow.parquet as pa from pyarrow.parquet import read_table import shapely import geoarrow.pyarrow as ga

tb = read_table(r"/home/parquet/buildings.parquet") dataset = ga.dataset(tb,geometry_columns=["geometry"])

gpdf_mask = gpd.read_file("/home/shapefiles/area_1.gpkg") bnds = gpdf_mask.iloc[0].geometry.wkt x = dataset.filter_fragments(bnds)`

I have tried with datasets in both epsg:4326 and epsg:25833. When i run len(x.to_table()) the length of the result is the same as the length of the original dataset. The geometries are saved as wkb.

I have used the following files to do the testing. I dont know if it is a bug or if it is something i am doing wrong.

paleolimbot commented 1 year ago

The GeoDataset is pretty experimental (and should be marked better as such!)...it mostly works if you've taken care to write the data very carefully such that row groups are small and contain vageuly proximal features. There's no function implemented to actually do that, so it's actually not all that useful 😬 .

(I'll take a look at the next issue which is almost certainly a bug!)

nagyrobir commented 1 year ago

Oh shoot! Would be so ridiculously cool to have predicate pushdown on a geometry level that would work on an "intersects" level even if it just a bounding box. I think the people developing Apache Sedona did it, but having it implemented on function level on a PC would be wonderful. Thanks for the response!

paleolimbot commented 1 year ago

It could almost certainly be done as a Python user-defined function based on the functions that currently exist in this package. That's not currently a development priority but it's definitely a good idea!

paleolimbot commented 1 year ago

Note that there is geoarrow.pyarrow.box(), which will calculate feature-level bounding boxes.

nagyrobir commented 1 year ago

I am not inteligent enough to know how difficult would be to implement, but having a the functionality of "Select all geometries that intersect this bounding box " without reading in all the data, is something that is a game changer for Geospatial Analysts working with millions of features. So i'll be watching for that PR. :D

nagyrobir commented 1 year ago

I was just wondering @paleolimbot How could i write the data so that the function works based on the files that i have provided. Any ideas?

paleolimbot commented 1 year ago

The easiest way that I know about is a "hilbert sort" (geopandas.GeoSeries.hilbert_distance()) and then pyarrrow.parquet.write_table() with relatively small row groups.

nagyrobir commented 1 year ago

I tried so hard and got so far, but in the end...couldn't make it work... Still returns back the whole dataset. The row groups return batches of 50 which are grouped. Am i missing something? (*cries)

import pyarrow.parquet as pa
import geopandas as gpd
import geoarrow.pyarrow as ga
from geoarrow.pyarrow.io import read_pyogrio_table
from geoarrow.pyarrow.dataset import GeoDataset
from geoarrow.pyarrow import dataset
from pyarrow.dataset import dataset as pads
from shapely.geometry import box,Polygon
from shapely.wkt import loads

#set all paths so nothing needs to be changed in the code

building_file = "buildings_wgs84.parquet"
building_reordered = "buildings_reordered.parquet"
filter_area = "area.gpkg"

#reading the original dataset
pa_ds = pads(building_file )
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])

#hilbert_distance + sorting by index
gpd_ds=gpd.GeoDataFrame(ga_ds.to_table().to_pandas())
gpd_ds["geometry"]=gpd.GeoSeries.from_wkb(gpd_ds["geometry"])
sorted_idx = gpd_ds["geometry"].hilbert_distance().sort_values().index
gdf_sorted = gpd_ds.reindex(sorted_idx)

#writing it back to geoparquet
##COMMENT  OUT IF YOU DO NOT WANT TO OVERWRITE
gdf_sorted.to_parquet(building_reordered,row_group_size = 100)

#import the mask and create a wkt out of it
gpdf_mask = gpd.read_file(filter_area , engine="pyogrio").explode()
gpdf_mask = gpdf_mask.set_crs(epsg=4326)
bnds = gpdf_mask['geometry'].iloc[0].wkt

#read back the sorted parquet file
pa_ds = pads(building_reordered)
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])

filtered_ds = ga_ds.filter_fragments(bnds).to_table()
len(filtered_ds) == len(gdf_sorted)

Length of the filtered

jorisvandenbossche commented 1 year ago

Just a quick naive question: you are sure the bnds are in the same CRS as that data in ga_ds? Because this won't be checked, and eg if querying EPSG:4326 data with a box of projected coordinates can be a silent bug.

Can you provide a reproducible example? The link for the files above doesn't work anymore.

nagyrobir commented 1 year ago

Hi @jorisvandenbossche. I appologize for the unclear comment. I have updated the above code snippet, and now you can access the files again at the link provided in the beggining.

nagyrobir commented 1 year ago

I think i have sort of made it work. Instead of reading in the dataset as pyarrow dataset and converting it to a GeoDataset using "from pyarrow.dataset import dataset as pads" i used "from geoarrow.pyarrow.dataset import dataset".

#read back the sorted parquet file
pa_ds = pads(building_reordered)
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])
filtered_ds = ga_ds.filter_fragments(bnds).to_table()

I've used this:


from geoarrow.pyarrow.dataset import dataset
ga_ds= dataset("buildings_reordered.parquet",geometry_columns= ["geometry"], use_row_groups=True)
filtered_ds = ga_ds.filter_fragments(bnds).to_table()

Of course the dataset in filtered_ds is much larger, but that would be consinstent with the documentation, where bbox intersection would be applied.

geoarrow / geoarrow-python

Function filter fragments returns the whole dataset #16