apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
382 stars 95 forks source link

GeoArrow Support? #2098

Open techmavengeospatial opened 2 months ago

techmavengeospatial commented 2 months ago

What feature or improvement would you like to see?

Is GeoArrow Supported?

lidavidm commented 2 months ago

Which driver specifically? I think it would be reasonable to map the PostGIS types to GeoArrow types in the PostgreSQL driver, at least.

paleolimbot commented 1 month ago

For PostGIS, we are reasonably close to being able to export geometry column with the "geoarrow.wkb" extension type. The main barrier that we have is that Postgres embeds the spatial reference identifier as an integer key at the row level, and GeoArrow embeds the coordinate reference system as a string at the type level.

Various tools work around this differently...GeoPandas, for example, pulls the SRID out of the first element, assumes that is an EPSG code, and assumes that all following values have the same SRID. This is frequently the case; however, I would like to do something safer by default at the driver level. There are a few options:

In PostGIS, the values themselves are emitted as EKWB, which is currently "valid" for the geoarrow.wkb extension type (although we prefer ISO WKB). Reencoding EWKB as ISO WKB is probably not worth it since most readers can handle both.

For SQLite, the type we're dealing with is probably GeoPackage's GEOMETRY (as reported by the decltype, which I don't think we inspect at all at the moment). This has a similar setup to PostGIS: the GEOMETRY column has an embeded item-level SRID but all elements in a single column have the same value whose textual representation is embedded in a dedicated table. The BLOB value that comes through from GeoPackage is prefixed WKB ( https://www.geopackage.org/spec/#gpb_data_blob_format ) which can be stripped before appending to the output ( https://github.com/paleolimbot/gpkg/blob/4d18da00b753f79b225d0e3ac7dac4e517733b06/src/gpkg.cpp#L196-L232 ).

For SQLite, there is also Spatialite, which has its own encoding ( https://www.gaia-gis.it/gaia-sins/BLOB-Geometry.html ). Annoyingly, I think this also has a decltype of GEOMETRY.

I'll have to double check for snowflake and bigquery, but I believe that snowflake emits GeoJSON with item-level SRID and BigQuery emits well-known text. BigQuery only deals with global "geographies", so item-level SRID is not a problem (i.e. it's always OGC:CRS84).

All of this is possible to stick in ADBC, but there's enough geometry-specific formatting that it might be worth figuring out how to make a wrapper driver (or make it easy to make a wrapper driver!) so that the geometry-specific details could live elsewhere. Also useful would be a driver embedded into GDAL (since GDAL already implements all of this logic, including export to Arrow and import from Arrow).

paleolimbot commented 1 month ago

An example of a current workaround for getting PostGIS to GeoArrow IPC in R (after a request for an example on Mastadon!):

# docker run -it --rm \
#   -e POSTGRES_PASSWORD=password \
#   -e POSTGRES_USER=postgres \
#   -p 5432:5432 \
#   postgis/postgis:latest
# ogr2ogr \
#   -f PostgreSQL PG:"host=127.0.0.1 user=postgres password=password dbname=postgres" \
#   /vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-basin_point.gpkg \
#   -nln basin_point

library(adbcdrivermanager)
library(arrow, warn.conflicts = FALSE)
library(geoarrow)

uri <- "postgresql://localhost:5432/postgres?user=postgres&password=password"
db <- adbc_database_init(adbcpostgresql::adbcpostgresql(), uri = uri)
con <- db |> 
  adbc_connection_init()

# Have to look up the CRS separately, unfortunately
crs_info <- con |> 
  read_adbc(
    paste(
      "SELECT geometry_columns.f_table_name AS tbl, spatial_ref_sys.srtext",
      "AS crs FROM geometry_columns INNER JOIN spatial_ref_sys ON ", 
      "geometry_columns.srid = spatial_ref_sys.srid",
      "WHERE geometry_columns.f_table_name = 'basin_point'"
    )
  ) |> 
  tibble::as_tibble() |> 
  print()
#> # A tibble: 1 × 2
#>   tbl         crs                                                               
#>   <chr>       <chr>                                                             
#> 1 basin_point "COMPD_CS[\"NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2…

# Read into an arrow::Table
tbl <- con |> 
  read_adbc("SELECT * FROM basin_point") |> 
  as_arrow_table()

# Convert the geometry column to geoarrow (probably zero copy although I forget the details)
# Explicitly attaching the a CRS here doesn't seem to work but should.
# You can also pick another geoarrow encoding here (e.g., geoarrow_point()).
vctr <- as_geoarrow_vctr(tbl$geom, schema = geoarrow_wkb())
attr(vctr, "schema") <- nanoarrow::as_nanoarrow_schema(geoarrow_wkb(crs = crs_info$crs))
tbl$geom <- vctr

# Write IPC
out <- tempfile()
arrow::write_ipc_stream(tbl, out)

# Should be able to read with crs
df <- nanoarrow::read_nanoarrow(out) |> 
  tibble::as_tibble()
df$geom <- sf::st_as_sfc(df$geom)
sf::st_as_sf(df)
#> Simple feature collection with 46 features and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 245206.7 ymin: 4820886 xmax: 739542 ymax: 5197865
#> Projected CRS: NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2013) height
#> # A tibble: 46 × 7
#>      fid objectid feat_code basin_name river     hid                 geom
#>    <int>    <dbl> <chr>     <chr>      <chr>     <chr>        <POINT [m]>
#>  1     1        1 WABA30    01EB000    BARRINGT… 9192… (277022.6 4820886)
#>  2     2        2 WABA30    01EC000    ROSEWAY/… 5293… (315701.1 4855050)
#>  3     3        3 WABA30    01EA000    TUSKET R… A759… (255728.5 4851021)
#>  4     4        4 WABA30    01DA000    METEGHAN  47EF… (245206.7 4895608)
#>  5     5        5 WABA30    01ED000    MERSEY    425C…   (337143 4860311)
#>  6     6        6 WABA30    01EE000    HERRING … C7BF… (370526.6 4896768)
#>  7     7        7 WABA30    01EG000    GOLD      7C6E… (396155.5 4928849)
#>  8     8        8 WABA30    01EF000    LAHAVE    B3D6… (380065.5 4918188)
#>  9     9        9 WABA30    01EJ000    SACKVILLE 4FA4… (449560.5 4927417)
#> 10    10       10 WABA30    01EH000    EAST/IND… 72FE… (417856.9 4932181)