Develop Write Example for pyogroio

ptomasula commented 3 months ago

@aufdenkampe thanks for the excellent performance research under #1. For the modified nested set index work under #3, I need to have a workflow where I write the resultant GeoDataFrame as output. I'm wonder if in your research you developed any example of outputting the data using these various libraries.

In using the pyogroio.write_dataframe method, I encountered the error below. I'm not certain how to go about setting this CRS information beyond the GeoDataFrame.set_crs method. If I call the GeoDataFrame.crs property, the results is indeed the expected CRS, so it seeming read in the CRS information ok. Would you be able to take a look and advise how to best export the resultant dataframe from the example_3 notebook?

Error Message ~Lib/site-packages/pyogrio/raw.py:842): UserWarning: 'crs' was not provided. The output dataset will not have projection information defined and may not be usable in other systems. warnings.warn(

aufdenkampe commented 3 months ago

@ptomasula, Thanks for creating this issue and assigning it to me. Indeed, I've already started working on a similar suite of benchmarks for writing the datasets GeoParquet, and my very next TODO for #1 was to complete that work and report out on recommendations. I very much want to get a few test files to @rajadain by the end of the week so he can try working with them before our next meeting.

I'll start on looking at your notebook.

aufdenkampe commented 2 months ago

@ptomasula, @kieranbartels, @rajadain, I've expanded the sandbox/geoarrow_parquet.ipynb notebook from benchmarking Read Method Performance (shared in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779) to now include benchmarking write methods. Here are my first results:

Write Compression Performance

Benchmarks based on reading all fields in the 702.3 MB TDX_streamnet_7020038340_01.gpkg file, and converting to a GeoPandas Dataframe with Shapely geometry using: gpd.read_file(tdx_stream_7020038340_fp, engine='pyogrio', use_arrow=True)

Compression	`gdf.to_parquet()` time	relative write speed	size in bytes	relative size	`gpd.read_parquet()` time	relative read speed
snappy	3.52 s ± 54.8 ms	1.0	300,048,527	2.7	2.22 s ± 43.2 ms	1.1
brotli	53.8 s ± 1.06 s	15.3	109,245,316	1.0	3.26 s ± 40.1 ms	1.6
lz4	3.75 s ± 39.5 ms	1.1	314,102,847	2.9	2.10 s ± 65 ms	1.0
zstd	4.15 s ± 47.8 ms	1.2	186,790,117	1.7	2.37 s ± 16.2	1.1
none	2.72 s ± 37.6 ms	0.8	593,366,150	5.4	2.02 s ± 64.2 ms	1.0

NOTE 1: Writing the GeoDataframe back to GeoPackage takes 11.4 s ± 823 ms usinggdf.to_file(path, driver='GPKG'), producing a file size of 702.1 MB. This should have automatically used the SOZip compression. https://gdal.org/drivers/vector/gpkg.html#compressed-files. So writing to GeoPackage is both much slower and uses more storage than GeoParquet

NOTE 2: Re-reading the GeoParquet only takes 11.3 ms ± 310 µs if we only read the non-geometry columns!

Conclusions

Use zstd compression, because it provides the highest levels of compression (31% the size of non-compressed) with read speeds that are are only 17% slower.

aufdenkampe commented 2 months ago

In sandbox/geoarrow_parquet.ipynb I also explored other write options, such as:

Does using a geoarrow geometry type (vs the default Shapely geometry type) improve write and re-read performance?
Does using geoarrow's methods improve write and re-read performance?

My conclusion to question 1 is that final Parquet files are nearly identical (all using Arrow-like structures) and that the differences are with the in-memory representations of the Geometries.

I've UPDATED my answer to question 2 after further analysis. YES GeoArrow can further improve performance, but primarily if you stay with PyArrow Tables rather than converting to Pandas/GeoPandas data frames. The following use geoarrow.pyarrow.io methods, imported as ga.io.

Compression	`ga.io.write_geoparquet_table()`	relative write speed	size	relative size	`ga.io.read_geoparquet_table()`	relative read speed
zstd	2.03 s ± 24.2 ms	0.6	186,793,666	1.7	1.48 s ± 11.1 ms	0.7

Relative values are relative to the table in the previous comment.

aufdenkampe commented 2 months ago

@rajadain, please use this GeoParquet file for your first round of tests: TDX_streamnet_7020038340_01.parquet.

This file does not contain any of the new "Nested Set" fields that we'll add for watershed delineation. We'll get that to you shortly.

Our documentation in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779 provides general guidance on the read methods. However, we will be further exploring use of geoarrow specifically for the use case for doing geographical calculations.

aufdenkampe commented 2 months ago

Another finding from sandbox/geoarrow_parquet.ipynb:

Downcasting numerical dtypes does not improve performance. Although going from int32 to int8 (etc.) where possible does reduce memory usage, it does not improve write or re-read speeds and barely improves storage size. This is true whether using dtype_backend='pyarrow' or the default 'numpy_nullable'.
Sorting the index (set to LINKNO) increases storage substantially (50%). This was surprising and hard to explain.

aufdenkampe commented 2 months ago

With the addition of examples/4_ProcessBasinToParquet.ipynb in commit cb245a08bef4ac5f07137d5157619dfe7eb06185, we now have functions for processing our original TDX Hydro GeoPackage files and saving them to compressed GeoParquet files. Specifically:

create_tdx_mnsi() in examples/3_GenerateModifiedNestedSetIndex.ipynb saves a streamnet GeoParquet file with Modified Nested Set Index fields, LINKNO fields converted to globally unique values, and useless fields dropped.
process_tdx_basins() in examples/4_ProcessBasinToParquet.ipynb saves a basins GeoParquet file with 'streamID' renamed to 'LINKNO' and then converted to globally unique values.

@ptomasula & @kieranbartels, let's turn these examples into a production pipeline to create the datasets we'll share with @rajadain.

aufdenkampe commented 2 months ago

@rajadain, here are examples of the files we plan to deliver to you:

TDX_streamnet_7020038340_01_mnsi.parquet (178 MB), stream network file with Modified Nested Set Index (MNSI) fields.
- For ingestion into vector tiles
- For calculating upstream LINKs (reaches) using MNSI, by only reading those relevant fields, speeding reads by 200x when omitting the geometry field.
TDX_streamreach_basins_7020038340_01.parquet (645 MB), basins geometry files, at full resolution.
- For identifying the LINKNO selected by a user.

For now, read these files using the gpd.read_parquet(geoparquet_path) method , but we can speed up reading the geometry fields 2x by reading as pyarrow.Tables, as described in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779. Note that either way, you need to have GDAL 3.9 installed, as described in that comment.

WikiWatershed / global-hydrography

Develop Write Example for pyogroio #4

Write Compression Performance

Conclusions