geopandas / geopandas

Python tools for geographic data
http://geopandas.org/
BSD 3-Clause "New" or "Revised" License
4.34k stars 907 forks source link

TST/ENH: more robust handling of column names in GeoParquet bbox covering support #3318

Closed nicholas-ys-tan closed 4 weeks ago

nicholas-ys-tan commented 1 month ago

Resolves #3308

This is a continuation of #3282 to address 3 items:

- ensure this is robust for the geometry column name (no hardcoded "geometry") The was an implicit hardcoding that occurred when checking of the covering encoding was in the metadata, and when seeking the bbox encoding field name. My approach assumes that there will only be one entry for the geo_metadata columns, which I assume to be a valid assumption as there can only be one active geometry.

- test reading a file that uses a different column name as "bbox" The code as it is addresses this by looking at the metadata to discover the field name. To facilitate testing of this, some refactoring was done to add the field name as a kwarg in the private functions. This still maintains "bbox" as the only allowable fieldname when writing, but this refactor facilitated writing the tests. Hope that is acceptable.

- test writing in case your geodataframe already has a bbox column When testing, no error showed when writing the parquet file, but an error was raised when reading - it was not a very descriptive error but seems to be attributed to two fields having identical names. A ValueError has been added if the dataframe already has a column with the name "bbox" and the user has write_bbox_covering=True.

One thing that I wanted to confirm is that we don't want the user to be able to specify their own bbox column? So they can put in their own custom bounds that they calculate or modify themselves for whatever reason - in which case I should be parsing their bbox column to ensure formatting is appropriate and pushing that in, instead of calculating it for the user?

jorisvandenbossche commented 1 month ago

One thing that I wanted to confirm is that we don't want the user to be able to specify their own bbox column? So they can put in their own custom bounds that they calculate or modify themselves for whatever reason

It's true that you might already have this data available, but personally I would leave that until later in case there is actually user request for this. In general calculating bounds is not that expensive.

jorisvandenbossche commented 1 month ago

In general calculating bounds is not that expensive.

I did a quick test using the nz-building-outlines.gpkg file (30 million polygons, 1.2 GB gpkg file), and writing that file with write_covering_bbox=True enabled. For this case, calculating the bounds for the bbox column takes around 2% of the to_parquet time.

jorisvandenbossche commented 4 weeks ago

Thanks @nicholas-ys-tan!