Deadlock when reading GeoParquet dataset with multiple files

kylebarron commented 6 months ago

@H-Plus-Time in case you have any interest in reading multi-file GeoParquet datasets, I'd love if you were able to check over my work, because I seem to have a deadlock somewhere.

For example, testing on this branch this works (in Deno)

const wasm = await import("./pkg/esm/index.js");
let _ = await wasm.default();
const arrow = await import("npm:apache-arrow");

const url =
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00226-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet";
const parquetFile = await new wasm.ParquetFile(url);

const minxPath = ["bbox", "minx"];
const minyPath = ["bbox", "miny"];
const maxxPath = ["bbox", "maxx"];
const maxyPath = ["bbox", "maxy"];

const readOptions = {
  bbox: [94.9218037, 26.7301782, 94.9618037, 26.7501782],
  bboxPaths: {
    minxPath,
    minyPath,
    maxxPath,
    maxyPath,
  },
};
const filteredTable = await parquetFile.read(readOptions);

but then this addition:


const urls = [
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00000-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00001-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00002-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00003-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00004-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00005-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00006-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00007-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00008-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00009-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00010-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00011-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00012-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet",
  "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme=buildings/type=building/part-00226-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet"
];

const parquetDataset = await new wasm.ParquetDataset(urls);
const table2 = await parquetDataset.read(readOptions);

seems to deadlock:

H-Plus-Time commented 6 months ago

Ah 😬. I wonder if it's reproducible when run against a pair, and a pair of significantly smaller/truncated files.

This worked before the object store change, right?

Thinking about it, this dataset is big enough that any leaks in get_opts could cause something similar to this (the channels used to smuggle the non-sens stream might be the culprit).

kylebarron commented 6 months ago

Sorry for the confusion, this didn't work before the object store change. I haven't tried it again with the object store merged in as well. But presumably this is an issue in my code with how I'm running the futures

kylebarron commented 6 months ago

I'm trying to play around with FuturesUnordered and a stream, thinking that maybe the join_all is too much to collect at once? But then I'm lost in how to collect the futures into a single table of record batches.

kylebarron commented 6 months ago

It's possible that the underlying code is in principle ok, but that it's failing with something specific to the large overture data.

Trying the most minimal test with local files (but using the same code path) works fine for me. Creating a basic multi-file Parquet dataset:

import geopandas as gpd

gdf = gpd.read_file(gpd.datasets.get_path("naturalearth_cities"))
gdf.iloc[:50].to_parquet("naturalearth_cities_part1.parquet")
gdf.iloc[50:100].to_parquet("naturalearth_cities_part2.parquet")
gdf.iloc[100:150].to_parquet("naturalearth_cities_part3.parquet")
gdf.iloc[150:200].to_parquet("naturalearth_cities_part4.parquet")
gdf.iloc[200:250].to_parquet("naturalearth_cities_part5.parquet")

Then reading via the geoarrow-rs Python bindings (virtualenv env && source ./env/bin/activate && maturin develop --release in the python/core dir) works fine (I've also hit the same "deadlock" issue in Python when reading a bounding box from a list of files):

from geoarrow.rust.core import ParquetDataset, ObjectStore, ParquetFile
root = "/Users/kyle/data/parquet"
store = ObjectStore(root, {})
path = "naturalearth_cities_part1.parquet"

file = ParquetFile(path, fs=store)

paths = [
    "naturalearth_cities_part1.parquet",
    "naturalearth_cities_part2.parquet",
    "naturalearth_cities_part3.parquet",
    "naturalearth_cities_part4.parquet",
    "naturalearth_cities_part5.parquet",
]
dataset = ParquetDataset(paths, fs=store)
table = dataset.read()
table.to_geopandas()

kylebarron commented 6 months ago

Pasting some references for myself:

H-Plus-Time commented 6 months ago

Alright, I've done a bit of digging, and it ended up being the handling of zero-length wkb geometries - using the same bbox for anything other than part-0026 results in zero rows, which in debug triggered panics in parse_geometry_to_native.

With a workaround in src/io/parquet/reader/async's read_builder function involving the table length (full details in the incoming PR), total execution time for me on a 100Mbps connection was about 15s (instant when targeting local of course).

H-Plus-Time commented 6 months ago

The ParquetDataset structs will definitely benefit from a read_stream method though, I blew the wasm memory limit a bunch of times :| with bigger bboxes.

kylebarron commented 6 months ago

Alright, I've done a bit of digging, and it ended up being the handling of zero-length wkb geometries - using the same bbox for anything other than part-0026 results in zero rows, which in debug triggered panics in parse_geometry_to_native.

That is amazing! Thank you so much for debugging this!

kylebarron commented 6 months ago

I blew the wasm memory limit a bunch of times :| with bigger bboxes.

I think this will also improve when I re-enable https://github.com/geoarrow/geoarrow-rs/blob/cae880040f933c281f12852be3ea60bd1d6ba907/src/io/parquet/reader/options.rs#L67-L68

This was implemented here: https://github.com/geoarrow/geoarrow-rs/blob/cae880040f933c281f12852be3ea60bd1d6ba907/src/io/parquet/reader/spatial_filter.rs#L198-L225

But the issue is that the order of the columns in that struct is the same as the original data, not in our query. So for example, the overture data is laid out as xmin, xmax, ymin, ymax. So I need to add a few lines of code to ensure we're matching the correct indices into the struct based on the names.

But for something like the overture data, where attributes are quite large in comparison to the bbox columns, this should significantly improve perf and memory use, because attributes won't even be decoded from Parquet for pages that are excluded from the bounding box query.

E.g. I was just testing with overture buildings data with a small bounding box over Boston and received a 900k row dataset, where the first 100k rows are nowhere near Boston:

I imagine that a small number of data pages actually intersect the query of interest.

kylebarron commented 6 months ago

@H-Plus-Time it turns out that what I thought was a deadlock was actually a horrible Deno bug: https://github.com/denoland/deno/issues/23385

H-Plus-Time commented 6 months ago

Yeah, I feel like it's jupyter kernel specific too, the moment I switched from deno run to using jupyter, and accidentally triggered a panic (e.g. url that pointed nowhere) - hanging forever.

kylebarron commented 6 months ago

I got the same behavior outside of Jupyter, but it had to be async

geoarrow / geoarrow-rs

Deadlock when reading GeoParquet dataset with multiple files #608