Closed kylebarron closed 2 months ago
Ah 😬. I wonder if it's reproducible when run against a pair, and a pair of significantly smaller/truncated files.
This worked before the object store change, right?
Thinking about it, this dataset is big enough that any leaks in get_opts could cause something similar to this (the channels used to smuggle the non-sens stream might be the culprit).
Sorry for the confusion, this didn't work before the object store change. I haven't tried it again with the object store merged in as well. But presumably this is an issue in my code with how I'm running the futures
I'm trying to play around with FuturesUnordered
and a stream, thinking that maybe the join_all
is too much to collect at once? But then I'm lost in how to collect the futures into a single table of record batches.
It's possible that the underlying code is in principle ok, but that it's failing with something specific to the large overture data.
Trying the most minimal test with local files (but using the same code path) works fine for me. Creating a basic multi-file Parquet dataset:
import geopandas as gpd
gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_cities"))
gdf.iloc[:50].to_parquet("naturalearth_cities_part1.parquet")
gdf.iloc[50:100].to_parquet("naturalearth_cities_part2.parquet")
gdf.iloc[100:150].to_parquet("naturalearth_cities_part3.parquet")
gdf.iloc[150:200].to_parquet("naturalearth_cities_part4.parquet")
gdf.iloc[200:250].to_parquet("naturalearth_cities_part5.parquet")
Then reading via the geoarrow-rs Python bindings (virtualenv env && source ./env/bin/activate && maturin develop --release
in the python/core
dir) works fine (I've also hit the same "deadlock" issue in Python when reading a bounding box from a list of files):
from geoarrow.rust.core import ParquetDataset, ObjectStore, ParquetFile
root = "/Users/kyle/data/parquet"
store = ObjectStore(root, {})
path = "naturalearth_cities_part1.parquet"
file = ParquetFile(path, fs=store)
paths = [
"naturalearth_cities_part1.parquet",
"naturalearth_cities_part2.parquet",
"naturalearth_cities_part3.parquet",
"naturalearth_cities_part4.parquet",
"naturalearth_cities_part5.parquet",
]
dataset = ParquetDataset(paths, fs=store)
table = dataset.read()
table.to_geopandas()
Pasting some references for myself:
Alright, I've done a bit of digging, and it ended up being the handling of zero-length wkb geometries - using the same bbox for anything other than part-0026 results in zero rows, which in debug triggered panics in parse_geometry_to_native.
With a workaround in src/io/parquet/reader/async's read_builder
function involving the table length (full details in the incoming PR), total execution time for me on a 100Mbps connection was about 15s (instant when targeting local of course).
The ParquetDataset structs will definitely benefit from a read_stream
method though, I blew the wasm memory limit a bunch of times :| with bigger bboxes.
Alright, I've done a bit of digging, and it ended up being the handling of zero-length wkb geometries - using the same bbox for anything other than part-0026 results in zero rows, which in debug triggered panics in parse_geometry_to_native.
That is amazing! Thank you so much for debugging this!
I blew the wasm memory limit a bunch of times :| with bigger bboxes.
I think this will also improve when I re-enable https://github.com/geoarrow/geoarrow-rs/blob/cae880040f933c281f12852be3ea60bd1d6ba907/src/io/parquet/reader/options.rs#L67-L68
This was implemented here: https://github.com/geoarrow/geoarrow-rs/blob/cae880040f933c281f12852be3ea60bd1d6ba907/src/io/parquet/reader/spatial_filter.rs#L198-L225
But the issue is that the order of the columns in that struct is the same as the original data, not in our query. So for example, the overture data is laid out as xmin, xmax, ymin, ymax
. So I need to add a few lines of code to ensure we're matching the correct indices into the struct based on the names.
But for something like the overture data, where attributes are quite large in comparison to the bbox columns, this should significantly improve perf and memory use, because attributes won't even be decoded from Parquet for pages that are excluded from the bounding box query.
E.g. I was just testing with overture buildings data with a small bounding box over Boston and received a 900k row dataset, where the first 100k rows are nowhere near Boston:
I imagine that a small number of data pages actually intersect the query of interest.
@H-Plus-Time it turns out that what I thought was a deadlock was actually a horrible Deno bug: https://github.com/denoland/deno/issues/23385
Yeah, I feel like it's jupyter kernel specific too, the moment I switched from deno run to using jupyter, and accidentally triggered a panic (e.g. url that pointed nowhere) - hanging forever.
I got the same behavior outside of Jupyter, but it had to be async
@H-Plus-Time in case you have any interest in reading multi-file GeoParquet datasets, I'd love if you were able to check over my work, because I seem to have a deadlock somewhere.
For example, testing on this branch this works (in Deno)
but then this addition:
seems to deadlock: