Leverage GeoArrow to speedup vector data processing & viz

aufdenkampe commented 1 month ago

The Apache Arrow language-independent columnar memory format improves computational and I/O performance so substantially that Pandas 2.0 (and many other libraries) are adopting arrow for the data backend. Pyarrow will be a required dependency of Pandas 3.0, which will use arrow over numpy by default.

This arrow revolution is now extending to geospatial encodings with the creation of GeoArrow as an alternative to wkt and wkb (Well Known Text/Binary) for geospatial vector data. The new geospatial visualization library, lonboard, leverages GeoArrow and GeoParquet to substantially speedup and scale up vector data visualization.

GDAL v3.9 released on May 10, 2024 added "read/write support for GeoArrow (struct-based) encoding (GeoParquet 1.1)". pyogrio v0.8.0 released May 6, 2024 introduced "writing based on Arrow as the transfer mechanism of the data from Python to GDAL" and "read_arrow and open_arrow now provide GeoArrow-compliant extension metadata, including the CRS."

The fact that pyogrio only recently enabled full support for arrow and explains why my tests of using the arrow engine in April didn't yet show expected speedups.

The full benefits of GeoArrow are not quite yet available to the entire geospatial python ecosystem. geoarrow-python does not yet have a release to take advantage of the latest GDAL and pyogrio releases (but the main branch does with PR#49) and the latest rasterio 1.3.10 is packaged to prevent conda from installing GDAL 3.9 into the environment.

So so we either need to wait a few weeks or months to fully benefit from GeoArrow or we need to install pre-release versions of geoarrow-python and rasterio (as shown here https://github.com/geoarrow/geoarrow-python?tab=readme-ov-file#installation).

Either way, we want to move toward using GeoArrow as soon as possible, as I suspect it will substantially improve our ability to simplify, dissolve, and visualize geometries for our global rivers and their basins.

cc: @kieranbartels, @ptomasula, @rajadain

rajadain commented 1 month ago

Thanks a lot @aufdenkampe for this write up! It has really helped me put the pieces together for how many of these technologies fit together.

I read into how Lonboard works, and it uses Deck.GL at the browser level.

ModelMW uses Leaflet, and it is quite heavily entrenched in the project. Upgrading that to MapLibre or Deck.GL would be a significant effort.

Given that, I don't think we can quite use the GeoArrow + GeoParquet solution for visualizations in ModelMW. However, it is still a good candidate for analysis.

For visualization, our best bet will likely be to generate Vector Tiles and serve them via S3. Leaflet has support for them via the VectorGrid plugin, and can visualize large amounts of data in a much more efficient manner than GeoJSON.

I'll try to create a demo using some of the streams / basins to show what it looks like, we'll have a better sense then.

One of the drawbacks of this approach is storage: we end up with two copies of the dataset, one in Vector Tiles, another in GeoParquet.

Apparently it is technically possible to read data from Vector Tiles for analysis, using libraries such as mapbox-vector-tile and shapely, which would allow us to use one dataset instead of two, but I'm not sure what the runtime performance would be like. Will have to make a couple experiments to see.

aufdenkampe commented 2 weeks ago

Conclusions on Read Method Performance

These results depend on using GDAL v3.9 and a dev version of geoarrow-python. For the tests below, our updated environment (6552d34) installed:

  - gdal  =3.9.0
  - pyogrio =0.8.0
  - pyarrow =16.1.0
  - geopandas =0.14.4
  - geoarrow-pyarrow =0.1.3.dev4 # installed from head of dev branch via github

Our previous environment, using gdal-3.8.5 and earlier versions of the above was 2-3x slower!

Benchmarks from commit 06e2d6d based on reading all fields in the 702.3 MB 'TDX_streamnet_7020038340_01.gpkg' file.

Function	engine	use_arrow	geoarrow	time	relative
`pyogrio.read_arrow()`	NA	NA	No	535 ms ± 12.9 ms	1.0
`pyogrio.read_dataframe()`	NA	True	No	2.24 s ± 18.8 ms	4.2
`pyogrio.read_dataframe()`	NA	False	No	8.54 s ± 4.94 s	16.0
`gpd.read_file()`	pyogrio	True	No	1.97 s ± 10.5 ms	3.7
`gpd.read_file()`	pyogrio	False	No	3.06 s ± 29.1 ms	5.7
`gpd.read_file()`	fiona	False	No	50.3 s ± 1.34 s	94.0
`pyogrio.read_arrow()`	NA	NA	Yes	552 ms ± 18.2 ms	1.0
`pyogrio.read_dataframe()`	NA	True	Yes	13.7 s ± 391 ms	25.6
`gpd.read_file()`	pyogrio	True	Yes	14.1 s ± 1.11 s	26.4

pyogrio.read_arrow() is ~4x faster than the fastest alternative method.

gpd.read_file(fp, engine='pyogrio', use_arrow=True) is the 2nd fastest method, but only before importing GeoArrow.

gpd.read_file() is the slowest method, 94x slower than pyogrio.read_arrow() and 16x slower than adding engine='pyogrio', use_arrow=True arguments!

Importing GeoArrow massively slows down pyogrio.read_dataframe() and gpd.read_file(). Read speeds for pyogrio.read_arrow() do not change.

NOTE:

pyogrio.read_dataframe(use_arrow=True) took 5-8 seconds (2-3 times slower!) when running a previous environment with gdal-3.8.5. It is critical to use the specified environment!

WikiWatershed / global-hydrography

Leverage GeoArrow to speedup vector data processing & viz #1

Conclusions on Read Method Performance