WikiWatershed / global-hydrography

Scripts to explore and process global hydrography (stream lines and basin boundaries) for Model My Watershed
MIT License
0 stars 0 forks source link

Leverage GeoArrow to speedup vector data processing & viz #1

Open aufdenkampe opened 1 month ago

aufdenkampe commented 1 month ago

The Apache Arrow language-independent columnar memory format improves computational and I/O performance so substantially that Pandas 2.0 (and many other libraries) are adopting arrow for the data backend. Pyarrow will be a required dependency of Pandas 3.0, which will use arrow over numpy by default.

This arrow revolution is now extending to geospatial encodings with the creation of GeoArrow as an alternative to wkt and wkb (Well Known Text/Binary) for geospatial vector data. The new geospatial visualization library, lonboard, leverages GeoArrow and GeoParquet to substantially speedup and scale up vector data visualization.

GDAL v3.9 released on May 10, 2024 added "read/write support for GeoArrow (struct-based) encoding (GeoParquet 1.1)". pyogrio v0.8.0 released May 6, 2024 introduced "writing based on Arrow as the transfer mechanism of the data from Python to GDAL" and "read_arrow and open_arrow now provide GeoArrow-compliant extension metadata, including the CRS."

The fact that pyogrio only recently enabled full support for arrow and explains why my tests of using the arrow engine in April didn't yet show expected speedups.

The full benefits of GeoArrow are not quite yet available to the entire geospatial python ecosystem. geoarrow-python does not yet have a release to take advantage of the latest GDAL and pyogrio releases (but the main branch does with PR#49) and the latest rasterio 1.3.10 is packaged to prevent conda from installing GDAL 3.9 into the environment.

So so we either need to wait a few weeks or months to fully benefit from GeoArrow or we need to install pre-release versions of geoarrow-python and rasterio (as shown here https://github.com/geoarrow/geoarrow-python?tab=readme-ov-file#installation).

Either way, we want to move toward using GeoArrow as soon as possible, as I suspect it will substantially improve our ability to simplify, dissolve, and visualize geometries for our global rivers and their basins.

cc: @kieranbartels, @ptomasula, @rajadain

rajadain commented 1 month ago

Thanks a lot @aufdenkampe for this write up! It has really helped me put the pieces together for how many of these technologies fit together.

I read into how Lonboard works, and it uses Deck.GL at the browser level.

ModelMW uses Leaflet, and it is quite heavily entrenched in the project. Upgrading that to MapLibre or Deck.GL would be a significant effort.

Given that, I don't think we can quite use the GeoArrow + GeoParquet solution for visualizations in ModelMW. However, it is still a good candidate for analysis.

For visualization, our best bet will likely be to generate Vector Tiles and serve them via S3. Leaflet has support for them via the VectorGrid plugin, and can visualize large amounts of data in a much more efficient manner than GeoJSON.

I'll try to create a demo using some of the streams / basins to show what it looks like, we'll have a better sense then.


One of the drawbacks of this approach is storage: we end up with two copies of the dataset, one in Vector Tiles, another in GeoParquet.

Apparently it is technically possible to read data from Vector Tiles for analysis, using libraries such as mapbox-vector-tile and shapely, which would allow us to use one dataset instead of two, but I'm not sure what the runtime performance would be like. Will have to make a couple experiments to see.

aufdenkampe commented 2 weeks ago

Conclusions on Read Method Performance

These results depend on using GDAL v3.9 and a dev version of geoarrow-python. For the tests below, our updated environment (6552d34) installed:

  - gdal  =3.9.0
  - pyogrio =0.8.0
  - pyarrow =16.1.0
  - geopandas =0.14.4
  - geoarrow-pyarrow =0.1.3.dev4 # installed from head of dev branch via github

Our previous environment, using gdal-3.8.5 and earlier versions of the above was 2-3x slower!

Benchmarks from commit 06e2d6d based on reading all fields in the 702.3 MB 'TDX_streamnet_7020038340_01.gpkg' file.

Function engine use_arrow geoarrow time relative
pyogrio.read_arrow() NA NA No 535 ms ± 12.9 ms 1.0
pyogrio.read_dataframe() NA True No 2.24 s ± 18.8 ms 4.2
pyogrio.read_dataframe() NA False No 8.54 s ± 4.94 s 16.0
gpd.read_file() pyogrio True No 1.97 s ± 10.5 ms 3.7
gpd.read_file() pyogrio False No 3.06 s ± 29.1 ms 5.7
gpd.read_file() fiona False No 50.3 s ± 1.34 s 94.0
pyogrio.read_arrow() NA NA Yes 552 ms ± 18.2 ms 1.0
pyogrio.read_dataframe() NA True Yes 13.7 s ± 391 ms 25.6
gpd.read_file() pyogrio True Yes 14.1 s ± 1.11 s 26.4

pyogrio.read_arrow() is ~4x faster than the fastest alternative method.

gpd.read_file(fp, engine='pyogrio', use_arrow=True) is the 2nd fastest method, but only before importing GeoArrow.

gpd.read_file() is the slowest method, 94x slower than pyogrio.read_arrow() and 16x slower than adding engine='pyogrio', use_arrow=True arguments!

Importing GeoArrow massively slows down pyogrio.read_dataframe() and gpd.read_file(). Read speeds for pyogrio.read_arrow() do not change.

NOTE: