Open techmavengeospatial opened 2 months ago
Which driver specifically? I think it would be reasonable to map the PostGIS types to GeoArrow types in the PostgreSQL driver, at least.
For PostGIS, we are reasonably close to being able to export geometry column with the "geoarrow.wkb" extension type. The main barrier that we have is that Postgres embeds the spatial reference identifier as an integer key at the row level, and GeoArrow embeds the coordinate reference system as a string at the type level.
Various tools work around this differently...GeoPandas, for example, pulls the SRID out of the first element, assumes that is an EPSG code, and assumes that all following values have the same SRID. This is frequently the case; however, I would like to do something safer by default at the driver level. There are a few options:
In PostGIS, the values themselves are emitted as EKWB, which is currently "valid" for the geoarrow.wkb
extension type (although we prefer ISO WKB). Reencoding EWKB as ISO WKB is probably not worth it since most readers can handle both.
For SQLite, the type we're dealing with is probably GeoPackage's GEOMETRY (as reported by the decltype
, which I don't think we inspect at all at the moment). This has a similar setup to PostGIS: the GEOMETRY column has an embeded item-level SRID but all elements in a single column have the same value whose textual representation is embedded in a dedicated table. The BLOB value that comes through from GeoPackage is prefixed WKB ( https://www.geopackage.org/spec/#gpb_data_blob_format ) which can be stripped before appending to the output ( https://github.com/paleolimbot/gpkg/blob/4d18da00b753f79b225d0e3ac7dac4e517733b06/src/gpkg.cpp#L196-L232 ).
For SQLite, there is also Spatialite, which has its own encoding ( https://www.gaia-gis.it/gaia-sins/BLOB-Geometry.html ). Annoyingly, I think this also has a decltype of GEOMETRY.
I'll have to double check for snowflake and bigquery, but I believe that snowflake emits GeoJSON with item-level SRID and BigQuery emits well-known text. BigQuery only deals with global "geographies", so item-level SRID is not a problem (i.e. it's always OGC:CRS84).
All of this is possible to stick in ADBC, but there's enough geometry-specific formatting that it might be worth figuring out how to make a wrapper driver (or make it easy to make a wrapper driver!) so that the geometry-specific details could live elsewhere. Also useful would be a driver embedded into GDAL (since GDAL already implements all of this logic, including export to Arrow and import from Arrow).
An example of a current workaround for getting PostGIS to GeoArrow IPC in R (after a request for an example on Mastadon!):
# docker run -it --rm \
# -e POSTGRES_PASSWORD=password \
# -e POSTGRES_USER=postgres \
# -p 5432:5432 \
# postgis/postgis:latest
# ogr2ogr \
# -f PostgreSQL PG:"host=127.0.0.1 user=postgres password=password dbname=postgres" \
# /vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-basin_point.gpkg \
# -nln basin_point
library(adbcdrivermanager)
library(arrow, warn.conflicts = FALSE)
library(geoarrow)
uri <- "postgresql://localhost:5432/postgres?user=postgres&password=password"
db <- adbc_database_init(adbcpostgresql::adbcpostgresql(), uri = uri)
con <- db |>
adbc_connection_init()
# Have to look up the CRS separately, unfortunately
crs_info <- con |>
read_adbc(
paste(
"SELECT geometry_columns.f_table_name AS tbl, spatial_ref_sys.srtext",
"AS crs FROM geometry_columns INNER JOIN spatial_ref_sys ON ",
"geometry_columns.srid = spatial_ref_sys.srid",
"WHERE geometry_columns.f_table_name = 'basin_point'"
)
) |>
tibble::as_tibble() |>
print()
#> # A tibble: 1 × 2
#> tbl crs
#> <chr> <chr>
#> 1 basin_point "COMPD_CS[\"NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2…
# Read into an arrow::Table
tbl <- con |>
read_adbc("SELECT * FROM basin_point") |>
as_arrow_table()
# Convert the geometry column to geoarrow (probably zero copy although I forget the details)
# Explicitly attaching the a CRS here doesn't seem to work but should.
# You can also pick another geoarrow encoding here (e.g., geoarrow_point()).
vctr <- as_geoarrow_vctr(tbl$geom, schema = geoarrow_wkb())
attr(vctr, "schema") <- nanoarrow::as_nanoarrow_schema(geoarrow_wkb(crs = crs_info$crs))
tbl$geom <- vctr
# Write IPC
out <- tempfile()
arrow::write_ipc_stream(tbl, out)
# Should be able to read with crs
df <- nanoarrow::read_nanoarrow(out) |>
tibble::as_tibble()
df$geom <- sf::st_as_sfc(df$geom)
sf::st_as_sf(df)
#> Simple feature collection with 46 features and 6 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 245206.7 ymin: 4820886 xmax: 739542 ymax: 5197865
#> Projected CRS: NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2013) height
#> # A tibble: 46 × 7
#> fid objectid feat_code basin_name river hid geom
#> <int> <dbl> <chr> <chr> <chr> <chr> <POINT [m]>
#> 1 1 1 WABA30 01EB000 BARRINGT… 9192… (277022.6 4820886)
#> 2 2 2 WABA30 01EC000 ROSEWAY/… 5293… (315701.1 4855050)
#> 3 3 3 WABA30 01EA000 TUSKET R… A759… (255728.5 4851021)
#> 4 4 4 WABA30 01DA000 METEGHAN 47EF… (245206.7 4895608)
#> 5 5 5 WABA30 01ED000 MERSEY 425C… (337143 4860311)
#> 6 6 6 WABA30 01EE000 HERRING … C7BF… (370526.6 4896768)
#> 7 7 7 WABA30 01EG000 GOLD 7C6E… (396155.5 4928849)
#> 8 8 8 WABA30 01EF000 LAHAVE B3D6… (380065.5 4918188)
#> 9 9 9 WABA30 01EJ000 SACKVILLE 4FA4… (449560.5 4927417)
#> 10 10 10 WABA30 01EH000 EAST/IND… 72FE… (417856.9 4932181)
What feature or improvement would you like to see?
Is GeoArrow Supported?