geoarrow / geoarrow-python

Python implementation of the GeoArrow specification
http://geoarrow.org/geoarrow-python/
Apache License 2.0
65 stars 4 forks source link

feat(geoarrow-pyarrow): Add GeoParquet io #34

Closed paleolimbot closed 11 months ago

paleolimbot commented 11 months ago

Adds direct Parquet to/from GeoArrow extension types (just in a Table, for now):

import geoarrow.pyarrow.io as io

tab = io.read_pyogrio_table("https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-basin_point.fgb.zip")
io.write_geoparquet_table(tab, "test.parquet")
io.read_geoparquet_table("test.parquet")
pyarrow.Table
OBJECTID: int64
FEAT_CODE: string
BASIN_NAME: string
RIVER: string
HID: string
wkb_geometry: extension<geoarrow.wkb<WkbType>>
----
OBJECTID: [[1,2,3,4,5,...,42,43,44,45,46]]
FEAT_CODE: [["WABA30","WABA30","WABA30","WABA30","WABA30",...,"WABA30","WABA30","WABA30","WABA30","WABA30"]]
BASIN_NAME: [["01EB000","01EC000","01EA000","01DA000","01ED000",...,"01FE000","01FB000","01FC000","01FD000","01EQ000"]]
RIVER: [["BARRINGTON/CLYDE","ROSEWAY/SABLE/JORDAN","TUSKET RIVER","METEGHAN","MERSEY",...,"INDIAN","MARGAREE","CHETICAMP RIVER","WRECK COVE","NEW HBR/SALMON"]]
HID: [["..."]]
wkb_geometry: [[...]]
codecov[bot] commented 11 months ago

Codecov Report

Merging #34 (649d7a8) into main (ff9771e) will increase coverage by 0.48%. The diff coverage is 99.34%.

@@            Coverage Diff             @@
##             main      #34      +/-   ##
==========================================
+ Coverage   94.59%   95.07%   +0.48%     
==========================================
  Files          10       10              
  Lines        1257     1401     +144     
==========================================
+ Hits         1189     1332     +143     
- Misses         68       69       +1     
Files Coverage Δ
geoarrow-pyarrow/src/geoarrow/pyarrow/_type.py 95.66% <100.00%> (+0.17%) :arrow_up:
geoarrow-pyarrow/src/geoarrow/pyarrow/io.py 99.32% <99.25%> (-0.68%) :arrow_down:
paleolimbot commented 11 months ago

Substantially faster than geopandas IO (just because it avoids converting to/from np.array(<shapely>)):

from pyarrow import feather
import geopandas
import geoarrow.pyarrow.io as io

# curl -L "https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line.arrow" -o ns-water-water_line.arrow
tab = feather.read_table("ns-water-water_line.arrow", columns=["geometry"])
df = tab.to_pandas()
df.geometry = df.geometry.geoarrow.to_geopandas()
df = geopandas.GeoDataFrame(df)
%timeit io.write_geoparquet_table(tab, "test.parquet")
365 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit io.read_geoparquet_table("test.parquet")
205 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.to_parquet("test.parquet")
1.42 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit geopandas.read_parquet("test.parquet")
941 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)