Open TomAugspurger opened 3 years ago
The ParquetDataset.schema
attribute is generally inferrred (first checking _metadata
, and otherwise _common_metadata
, and otherwise the first file of the dataset), so in general "metadata" of the schema will not reflect the full dataset. For example, we also save RangeIndex as metadata (storing start/stop/step), which will typically also not reflect the full dataset.
So you're probably correct that _metadata
/_common_data
being created from a single partition (at least the schema, further row group metadata can be appended) is the cause here. And that's something that we could in theory fix. But then you would still run into the same issue when no _metadata
/_common_metadata
file is present in your partitioned dataset and the schema is inferred from the first file.
Thus, checking ParquetDataset.schema.metadata["geo"]
will probably never be a robust way to get the "full bbox".
A way that you can currently check the global bbox is with:
ddf = dask_geopandas.read_parquet("partitioned.parquet")
if ddf.spatial_partitions is not None:
bbox = ddf.spatial_partitions.total_bounds
That is without reading any actual data (although the read_parquet
does touch all files to read the parquet FileMetadData to parse the geo metadata to get the bbox).
Thanks for the explanation.
For reference, I'm looking to catalog some parquet datasets with STAC, and am automating item generation in https://github.com/TomAugspurger/stac-table/blob/79b94eb9d3eb95aac4bcf3b07538bbcbcfbfd837/stac_table.py#L55-L63. In theory, that could be used on datasets without a _metadata
/ _common_metadata
, so I would want to check each partition's bounds, even if dask-geopandas did "fix" the metadata files it produces.
Yes, then it's best to check each partitions bounds (which can still be done with my snippet above if you want (or parse each file's metadata yourself), although you will need a fall back to calculate it I suppose).
In the context of STAC, I suppose that the STAC item's own metadata will store the information of the bbox as well? (once generated, you still need to compute it when generating it of course)
In the context of STAC, I suppose that the STAC item's own metadata will store the information of the bbox as well? (once generated, you still need to compute it when generating it of course)
Yep, the item has the spatial extent, which would be the union of all the bounding boxes (the spec also supports multiple bounding boxes for disjoint regions (see https://planetarycomputer.microsoft.com/dataset/hrea). I'm not sure whether it makes sense to do that for parquet datasets where there are potentially thousands of partitions / bounding boxes).
This might be a known issue, but currently the
geo
metadata is incorrect when round-tripping a partitioned DaskGeoDataFrame.Some setup:
With a single dataset (just
geopandas
), here's the bboxFor the partitioned dataset, the bbox is
I need to confirm what's going on, but I think that the
_common_metadata
is using just the data / metadata from a single partition (not sure if it's the first / last), which has the smaller bounding box. We might need to rewrite the_common_metadata
with a union of the bounding boxes.