WikiWatershed / global-hydrography

Scripts to explore and process global hydrography (stream lines and basin boundaries) for Model My Watershed
MIT License
0 stars 0 forks source link

Develop Write Example for pyogroio #4

Closed ptomasula closed 2 months ago

ptomasula commented 3 months ago

@aufdenkampe thanks for the excellent performance research under #1. For the modified nested set index work under #3, I need to have a workflow where I write the resultant GeoDataFrame as output. I'm wonder if in your research you developed any example of outputting the data using these various libraries.

In using the pyogroio.write_dataframe method, I encountered the error below. I'm not certain how to go about setting this CRS information beyond the GeoDataFrame.set_crs method. If I call the GeoDataFrame.crs property, the results is indeed the expected CRS, so it seeming read in the CRS information ok. Would you be able to take a look and advise how to best export the resultant dataframe from the example_3 notebook?

Error Message ~Lib/site-packages/pyogrio/raw.py:842): UserWarning: 'crs' was not provided. The output dataset will not have projection information defined and may not be usable in other systems. warnings.warn(

aufdenkampe commented 3 months ago

@ptomasula, Thanks for creating this issue and assigning it to me. Indeed, I've already started working on a similar suite of benchmarks for writing the datasets GeoParquet, and my very next TODO for #1 was to complete that work and report out on recommendations. I very much want to get a few test files to @rajadain by the end of the week so he can try working with them before our next meeting.

I'll start on looking at your notebook.

aufdenkampe commented 2 months ago

@ptomasula, @kieranbartels, @rajadain, I've expanded the sandbox/geoarrow_parquet.ipynb notebook from benchmarking Read Method Performance (shared in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779) to now include benchmarking write methods. Here are my first results:

Write Compression Performance

Benchmarks based on reading all fields in the 702.3 MB TDX_streamnet_7020038340_01.gpkg file, and converting to a GeoPandas Dataframe with Shapely geometry using: gpd.read_file(tdx_stream_7020038340_fp, engine='pyogrio', use_arrow=True)

Compression gdf.to_parquet() time relative write speed size in bytes relative size gpd.read_parquet() time relative read speed
snappy 3.52 s ± 54.8 ms 1.0 300,048,527 2.7 2.22 s ± 43.2 ms 1.1
brotli 53.8 s ± 1.06 s 15.3 109,245,316 1.0 3.26 s ± 40.1 ms 1.6
lz4 3.75 s ± 39.5 ms 1.1 314,102,847 2.9 2.10 s ± 65 ms 1.0
zstd 4.15 s ± 47.8 ms 1.2 186,790,117 1.7 2.37 s ± 16.2 1.1
none 2.72 s ± 37.6 ms 0.8 593,366,150 5.4 2.02 s ± 64.2 ms 1.0

NOTE 1: Writing the GeoDataframe back to GeoPackage takes 11.4 s ± 823 ms usinggdf.to_file(path, driver='GPKG'), producing a file size of 702.1 MB. This should have automatically used the SOZip compression. https://gdal.org/drivers/vector/gpkg.html#compressed-files. So writing to GeoPackage is both much slower and uses more storage than GeoParquet

NOTE 2: Re-reading the GeoParquet only takes 11.3 ms ± 310 µs if we only read the non-geometry columns!

Conclusions

Use zstd compression, because it provides the highest levels of compression (31% the size of non-compressed) with read speeds that are are only 17% slower.

aufdenkampe commented 2 months ago

In sandbox/geoarrow_parquet.ipynb I also explored other write options, such as:

  1. Does using a geoarrow geometry type (vs the default Shapely geometry type) improve write and re-read performance?
  2. Does using geoarrow's methods improve write and re-read performance?

My conclusion to question 1 is that final Parquet files are nearly identical (all using Arrow-like structures) and that the differences are with the in-memory representations of the Geometries.

I've UPDATED my answer to question 2 after further analysis. YES GeoArrow can further improve performance, but primarily if you stay with PyArrow Tables rather than converting to Pandas/GeoPandas data frames. The following use geoarrow.pyarrow.io methods, imported as ga.io.

Compression ga.io.write_geoparquet_table() relative write speed size relative size ga.io.read_geoparquet_table() relative read speed
zstd 2.03 s ± 24.2 ms 0.6 186,793,666 1.7 1.48 s ± 11.1 ms 0.7

Relative values are relative to the table in the previous comment.

aufdenkampe commented 2 months ago

@rajadain, please use this GeoParquet file for your first round of tests: TDX_streamnet_7020038340_01.parquet.

This file does not contain any of the new "Nested Set" fields that we'll add for watershed delineation. We'll get that to you shortly.

Our documentation in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779 provides general guidance on the read methods. However, we will be further exploring use of geoarrow specifically for the use case for doing geographical calculations.

aufdenkampe commented 2 months ago

Another finding from sandbox/geoarrow_parquet.ipynb:

aufdenkampe commented 2 months ago

With the addition of examples/4_ProcessBasinToParquet.ipynb in commit cb245a08bef4ac5f07137d5157619dfe7eb06185, we now have functions for processing our original TDX Hydro GeoPackage files and saving them to compressed GeoParquet files. Specifically:

@ptomasula & @kieranbartels, let's turn these examples into a production pipeline to create the datasets we'll share with @rajadain.

aufdenkampe commented 2 months ago

@rajadain, here are examples of the files we plan to deliver to you:

For now, read these files using the gpd.read_parquet(geoparquet_path) method , but we can speed up reading the geometry fields 2x by reading as pyarrow.Tables, as described in https://github.com/WikiWatershed/global-hydrography/issues/1#issuecomment-2184108779. Note that either way, you need to have GDAL 3.9 installed, as described in that comment.