geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
503 stars 45 forks source link

Incorrect `geo` metadata bbox for partitioned datasets when reading / write to parquet #94

Open TomAugspurger opened 3 years ago

TomAugspurger commented 3 years ago

This might be a known issue, but currently the geo metadata is incorrect when round-tripping a partitioned DaskGeoDataFrame.

Some setup:

import json
import geopandas
import dask_geopandas
import pyarrow.parquet

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ddf = dask_geopandas.from_geopandas(df, npartitions=2)

df.to_parquet("single.parquet")
ddf.to_parquet("partitioned.parquet")

With a single dataset (just geopandas), here's the bbox

>>> json.loads(pyarrow.parquet.ParquetDataset("single.parquet", use_legacy_dataset=False).schema.metadata[b"geo"])["columns"]["geometry"]["bbox"]
[-180.0, -90.0, 180.00000000000006, 83.64513000000001]

For the partitioned dataset, the bbox is

>>> json.loads(pyarrow.parquet.ParquetDataset("partitioned.parquet", use_legacy_dataset=False).schema.metadata[b"geo"])["columns"]["geometry"]["bbox"]
[-180.0, -55.61183, 180.00000000000006, 83.64513000000001]

I need to confirm what's going on, but I think that the _common_metadata is using just the data / metadata from a single partition (not sure if it's the first / last), which has the smaller bounding box. We might need to rewrite the _common_metadata with a union of the bounding boxes.

jorisvandenbossche commented 3 years ago

The ParquetDataset.schema attribute is generally inferrred (first checking _metadata, and otherwise _common_metadata, and otherwise the first file of the dataset), so in general "metadata" of the schema will not reflect the full dataset. For example, we also save RangeIndex as metadata (storing start/stop/step), which will typically also not reflect the full dataset.

So you're probably correct that _metadata/_common_data being created from a single partition (at least the schema, further row group metadata can be appended) is the cause here. And that's something that we could in theory fix. But then you would still run into the same issue when no _metadata/_common_metadata file is present in your partitioned dataset and the schema is inferred from the first file. Thus, checking ParquetDataset.schema.metadata["geo"] will probably never be a robust way to get the "full bbox".

A way that you can currently check the global bbox is with:

ddf = dask_geopandas.read_parquet("partitioned.parquet")
if ddf.spatial_partitions is not None:
    bbox = ddf.spatial_partitions.total_bounds

That is without reading any actual data (although the read_parquet does touch all files to read the parquet FileMetadData to parse the geo metadata to get the bbox).

TomAugspurger commented 3 years ago

Thanks for the explanation.

For reference, I'm looking to catalog some parquet datasets with STAC, and am automating item generation in https://github.com/TomAugspurger/stac-table/blob/79b94eb9d3eb95aac4bcf3b07538bbcbcfbfd837/stac_table.py#L55-L63. In theory, that could be used on datasets without a _metadata / _common_metadata, so I would want to check each partition's bounds, even if dask-geopandas did "fix" the metadata files it produces.

jorisvandenbossche commented 3 years ago

Yes, then it's best to check each partitions bounds (which can still be done with my snippet above if you want (or parse each file's metadata yourself), although you will need a fall back to calculate it I suppose).

In the context of STAC, I suppose that the STAC item's own metadata will store the information of the bbox as well? (once generated, you still need to compute it when generating it of course)

TomAugspurger commented 3 years ago

In the context of STAC, I suppose that the STAC item's own metadata will store the information of the bbox as well? (once generated, you still need to compute it when generating it of course)

Yep, the item has the spatial extent, which would be the union of all the bounding boxes (the spec also supports multiple bounding boxes for disjoint regions (see https://planetarycomputer.microsoft.com/dataset/hrea). I'm not sure whether it makes sense to do that for parquet datasets where there are potentially thousands of partitions / bounding boxes).