Closed ptomasula closed 2 months ago
@ptomasula, Thanks for creating this issue and assigning it to me. Indeed, I've already started working on a similar suite of benchmarks for writing the datasets GeoParquet, and my very next TODO for #1 was to complete that work and report out on recommendations. I very much want to get a few test files to @rajadain by the end of the week so he can try working with them before our next meeting.
I'll start on looking at your notebook.
@ptomasula, @kieranbartels, @rajadain, I've expanded the sandbox/geoarrow_parquet.ipynb
notebook from benchmarking Read Method Performance (shared in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779) to now include benchmarking write methods. Here are my first results:
Benchmarks based on reading all fields in the 702.3 MB TDX_streamnet_7020038340_01.gpkg
file, and converting to a GeoPandas Dataframe with Shapely geometry using:
gpd.read_file(tdx_stream_7020038340_fp, engine='pyogrio', use_arrow=True)
Compression | gdf.to_parquet() time |
relative write speed | size in bytes | relative size | gpd.read_parquet() time |
relative read speed |
---|---|---|---|---|---|---|
snappy | 3.52 s ± 54.8 ms | 1.0 | 300,048,527 | 2.7 | 2.22 s ± 43.2 ms | 1.1 |
brotli | 53.8 s ± 1.06 s | 15.3 | 109,245,316 | 1.0 | 3.26 s ± 40.1 ms | 1.6 |
lz4 | 3.75 s ± 39.5 ms | 1.1 | 314,102,847 | 2.9 | 2.10 s ± 65 ms | 1.0 |
zstd | 4.15 s ± 47.8 ms | 1.2 | 186,790,117 | 1.7 | 2.37 s ± 16.2 | 1.1 |
none | 2.72 s ± 37.6 ms | 0.8 | 593,366,150 | 5.4 | 2.02 s ± 64.2 ms | 1.0 |
NOTE 1: Writing the GeoDataframe back to GeoPackage takes 11.4 s ± 823 ms usinggdf.to_file(path, driver='GPKG')
, producing a file size of 702.1 MB. This should have automatically used the SOZip compression. https://gdal.org/drivers/vector/gpkg.html#compressed-files. So writing to GeoPackage is both much slower and uses more storage than GeoParquet
NOTE 2: Re-reading the GeoParquet only takes 11.3 ms ± 310 µs if we only read the non-geometry columns!
Use zstd
compression, because it provides the highest levels of compression (31% the size of non-compressed) with read speeds that are are only 17% slower.
In sandbox/geoarrow_parquet.ipynb
I also explored other write options, such as:
geoarrow
geometry type (vs the default Shapely geometry type) improve write and re-read performance?My conclusion to question 1 is that final Parquet files are nearly identical (all using Arrow-like structures) and that the differences are with the in-memory representations of the Geometries.
I've UPDATED my answer to question 2 after further analysis. YES GeoArrow can further improve performance, but primarily if you stay with PyArrow Tables rather than converting to Pandas/GeoPandas data frames. The following use geoarrow.pyarrow.io
methods, imported as ga.io
.
Compression | ga.io.write_geoparquet_table() |
relative write speed | size | relative size | ga.io.read_geoparquet_table() |
relative read speed |
---|---|---|---|---|---|---|
zstd | 2.03 s ± 24.2 ms | 0.6 | 186,793,666 | 1.7 | 1.48 s ± 11.1 ms | 0.7 |
Relative values are relative to the table in the previous comment.
@rajadain, please use this GeoParquet file for your first round of tests: TDX_streamnet_7020038340_01.parquet.
This file does not contain any of the new "Nested Set" fields that we'll add for watershed delineation. We'll get that to you shortly.
Our documentation in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779 provides general guidance on the read methods. However, we will be further exploring use of geoarrow
specifically for the use case for doing geographical calculations.
Another finding from sandbox/geoarrow_parquet.ipynb
:
int32
to int8
(etc.) where possible does reduce memory usage, it does not improve write or re-read speeds and barely improves storage size. This is true whether using dtype_backend='pyarrow'
or the default 'numpy_nullable'
.With the addition of examples/4_ProcessBasinToParquet.ipynb
in commit cb245a08bef4ac5f07137d5157619dfe7eb06185, we now have functions for processing our original TDX Hydro GeoPackage files and saving them to compressed GeoParquet files. Specifically:
create_tdx_mnsi()
in examples/3_GenerateModifiedNestedSetIndex.ipynb
saves a streamnet GeoParquet file with Modified Nested Set Index fields, LINKNO fields converted to globally unique values, and useless fields dropped.process_tdx_basins()
in examples/4_ProcessBasinToParquet.ipynb
saves a basins GeoParquet file with 'streamID' renamed to 'LINKNO' and then converted to globally unique values.@ptomasula & @kieranbartels, let's turn these examples into a production pipeline to create the datasets we'll share with @rajadain.
@rajadain, here are examples of the files we plan to deliver to you:
For now, read these files using the gpd.read_parquet(geoparquet_path)
method , but we can speed up reading the geometry fields 2x by reading as pyarrow.Tables
, as described in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779. Note that either way, you need to have GDAL 3.9 installed, as described in that comment.
@aufdenkampe thanks for the excellent performance research under #1. For the modified nested set index work under #3, I need to have a workflow where I write the resultant GeoDataFrame as output. I'm wonder if in your research you developed any example of outputting the data using these various libraries.
In using the pyogroio.write_dataframe method, I encountered the error below. I'm not certain how to go about setting this CRS information beyond the GeoDataFrame.set_crs method. If I call the GeoDataFrame.crs property, the results is indeed the expected CRS, so it seeming read in the CRS information ok. Would you be able to take a look and advise how to best export the resultant dataframe from the example_3 notebook?
Error Message
~Lib/site-packages/pyogrio/raw.py:842): UserWarning: 'crs' was not provided. The output dataset will not have projection information defined and may not be usable in other systems. warnings.warn(