Open kylebarron opened 1 year ago
Do you happen to know how a POSTGIS geometry arrives over the wire via COPY
? (I suppose one could open a debugger and check...I think it might be the bytes of the gserialized version of the geometry). I obviously have a vested interest in making this a thing (also for SQLite + GeoPackage, which has a similar issue where the way the geometry arrives by default is not quite how we'd want it represented in query output.
One approach would be to fork the Postgres and SQLite drivers in the geoarrow org and re-package them with geometry support.
Do you happen to know how a POSTGIS geometry arrives over the wire via
COPY
?
I don't. That indeed seems like the first thing to establish. I thought that it serialized to WKB by default but maybe that happens on the client and not on the wire itself.
One approach would be to fork the Postgres and SQLite drivers in the geoarrow org and re-package them with geometry support.
Would that work well with ADBC? Forking postgres and sqlite drivers seems like a big undertaking. I was imagining/hoping it should be possible to have some small function that just "plugs in" and converts to geoarrow without a large fork
I agree that would be ideal although I suspect (but don't know) that extensibility isn't likely to be a target for a bit and so I wonder if the first target should be a workaround of some kind.
Another workaround would be to write a driver-that-wraps-a-driver...basically make a driver wrapper that passes along all the method calls but wraps the output ArrowArrayStream
with an implementation that parses the raw output, releases the source array, and replaces it with the parsed version. Not ideal because there's one more copy than is theoretically necessary, but would at least get everybody's workflow sorted (e.g., use the ADBC postgis driver and get geoarrow output!).
Another workaround would be to write a driver-that-wraps-a-driver...basically make a driver wrapper that passes along all the method calls but wraps the output
ArrowArrayStream
with an implementation that parses the raw output, releases the source array, and replaces it with the parsed version
That sounds ideal to me to get started. I guess the first thing to check is what the current ADBC driver behavior is with postgis types. Would/could it return a BinaryArray
with whatever blobs are returned from the postgis wire format, which a wrapper could convert into geoarrow? Based on the type support doc I was guessing it would fail on non-default types, but I haven't tested this yet.
An easier initial target is probably GeoPackage + SQLite...those values almost certainly come through as binary and I've already written the bit that strips the WKB header ( https://github.com/paleolimbot/gpkg/blob/master/src/gpkg.cpp#L196-L232 ). Then the same logic could be applied to the postgres driver. I imagine that both drivers will need some updates to pass on more information about column types as schema metadata.
Right now it should fail.
My thought is
After #636 this should work out-of-the-box. It looks like geometry columns are sent across the wire as EWKB:
# docker run -it --rm \
# -e POSTGRES_PASSWORD=password \
# -e POSTGRES_USER=postgres \
# -p 5432:5432 \
# postgis/postgis:latest
# ogr2ogr \
# -f PostgreSQL PG:"host=127.0.0.1 user=postgres password=password dbname=postgres" \
# /vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-basin_point.gpkg \
# -nln basin_point
# [libpq] Column #7 ("geom") has unknown type code 18011
library(adbcdrivermanager)
uri <- "postgresql://localhost:5432/postgres?user=postgres&password=password"
db <- adbc_database_init(adbcpostgresql::adbcpostgresql(), uri = uri)
con <- adbc_connection_init(db)
stmt <- adbc_statement_init(con)
stream <- nanoarrow::nanoarrow_allocate_array_stream()
adbc_statement_set_sql_query(stmt, "SELECT * FROM basin_point")
adbc_statement_execute_query(stmt, stream)
#> [1] -1
(result <- tibble::as_tibble(stream))
#> # A tibble: 46 × 7
#> fid objectid feat_code basin_name river hid geom
#> <int> <dbl> <chr> <chr> <chr> <chr> <blob>
#> 1 1 1 WABA30 01EB000 BARRINGTON/CLYDE 919201D6… <raw 25 B>
#> 2 2 2 WABA30 01EC000 ROSEWAY/SABLE/JORDAN 5293753C… <raw 25 B>
#> 3 3 3 WABA30 01EA000 TUSKET RIVER A7592A93… <raw 25 B>
#> 4 4 4 WABA30 01DA000 METEGHAN 47EF929A… <raw 25 B>
#> 5 5 5 WABA30 01ED000 MERSEY 425CA3DB… <raw 25 B>
#> 6 6 6 WABA30 01EE000 HERRING COVE/MEDWAY C7BF151C… <raw 25 B>
#> 7 7 7 WABA30 01EG000 GOLD 7C6E0383… <raw 25 B>
#> 8 8 8 WABA30 01EF000 LAHAVE B3D68107… <raw 25 B>
#> 9 9 9 WABA30 01EJ000 SACKVILLE 4FA406BD… <raw 25 B>
#> 10 10 10 WABA30 01EH000 EAST/INDIAN RIVER 72FEA676… <raw 25 B>
#> # ℹ 36 more rows
wk::as_wkb(result$geom)
#> <wk_wkb[46]>
#> [1] <SRID=900914;POINT (277022.6 4820886)>
#> [2] <SRID=900914;POINT (315701.1 4855050)>
#> [3] <SRID=900914;POINT (255728.5 4851021)>
#> [4] <SRID=900914;POINT (245206.7 4895608)>
#> [5] <SRID=900914;POINT (337143 4860311)>
#> [6] <SRID=900914;POINT (370526.6 4896768)>
#> [7] <SRID=900914;POINT (396155.5 4928849)>
#> [8] <SRID=900914;POINT (380065.5 4918188)>
#> [9] <SRID=900914;POINT (449560.5 4927417)>
#> [10] <SRID=900914;POINT (417856.9 4932181)>
#> [11] <SRID=900914;POINT (266163.3 4950370)>
#> [12] <SRID=900914;POINT (512409.7 4958059)>
#> [13] <SRID=900914;POINT (411009.1 4970316)>
#> [14] <SRID=900914;POINT (486375.9 4946504)>
#> [15] <SRID=900914;POINT (578532 4975507)>
#> [16] <SRID=900914;POINT (528643.6 4989779)>
#> [17] <SRID=900914;POINT (430443.1 5003879)>
#> [18] <SRID=900914;POINT (384152.3 4987549)>
#> [19] <SRID=900914;POINT (598568 5004793)>
#> [20] <SRID=900914;POINT (468882.8 4991420)>
#> [21] <SRID=900914;POINT (664220.6 4997062)>
#> [22] <SRID=900914;POINT (316224.4 4988672)>
#> [23] <SRID=900914;POINT (566212 5021547)>
#> [24] <SRID=900914;POINT (432768.6 5030333)>
#> [25] <SRID=900914;POINT (478632.6 5029750)>
#> [26] <SRID=900914;POINT (625039.5 5038466)>
#> [27] <SRID=900914;POINT (357606 5039295)>
#> [28] <SRID=900914;POINT (526828 5044274)>
#> [29] <SRID=900914;POINT (667060.1 5053682)>
#> [30] <SRID=900914;POINT (499160.7 5068941)>
#> [31] <SRID=900914;POINT (449106.9 5068988)>
#> [32] <SRID=900914;POINT (557031.5 5066144)>
#> [33] <SRID=900914;POINT (576642 5050958)>
#> [34] <SRID=900914;POINT (406967.3 5085306)>
#> [35] <SRID=900914;POINT (606687.9 5063043)>
#> [36] <SRID=900914;POINT (418753.5 5086491)>
#> [37] <SRID=900914;POINT (653615.8 5083572)>
#> [38] <SRID=900914;POINT (395483.2 5079346)>
#> [39] <SRID=900914;POINT (668426.1 5118705)>
#> [40] <SRID=900914;POINT (614040.2 5097386)>
#> [41] <SRID=900914;POINT (720114.2 5094879)>
#> [42] <SRID=900914;POINT (697678 5137499)>
#> [43] <SRID=900914;POINT (644382.4 5130242)>
#> [44] <SRID=900914;POINT (663219.1 5197865)>
#> [45] <SRID=900914;POINT (701578 5185235)>
#> [46] <SRID=900914;POINT (739542 4873410)>
Created on 2023-05-03 with reprex v2.0.2
FWIW, I did learn in trying this out that PostGIS has a custom send/recv definition. Geoarrow could define a "geoarrow.ewkb" extension type to acomodate this...in that case, the driver would just have to attach an extension name which is probably doable.
Reading through this again... this is really cool and I want to try this first-hand soon. Maybe we should move discussion to the geoarrow repo, but I wonder here how we should handle the SRID=900914
. Should the onus be on the user to also query the spatial ref sys table so that it knows what the SRID
defines? Otherwise the user doesn't have access to the same SRID table
It's a good point...going from value-level SRID to column-level SRID is hard and would require either:
AdbcStatementExecuteQuery()
and using the first values to query the SRID table and add it to the type before populating the result ArrowArrayStream
. This is what the SQLite driver does.PQresult
would know what table and what column index (via PQftable()
and PQftablecol()
). In theory one could construct a query in AdbcStatementExecuteQuery()
before populating the ArrowArrayStream
result that fills in the CRS. SQLite similarly has options for this ( https://www.sqlite.org/c3ref/column_database_name.html ).The first option could be done in a wrapper driver...the second one would have to be passed on via field metadata to be used in a wrapper driver (and wouldn't work for queries with PostGIS or the various spatial SQLite function calls). Collecting some rows from the start might be helpful when attempting to convert a geometry column to geoarrow, too (it could try to guess a single geometry type, for example).
Probably the first "geoadbc" driver should be a thin wrapper around OGR's ArrowArrayStream export? That may help establish some conventions (e.g., option names).
Worth chatting about in the geoarrow org somewhere too, but also worth considering which pieces of metadata we would need from the upstream driver to make geospatial wrappers since those changes would have to happen here.
It would also be cool to handle geospatial natively where possible (e.g., where "support" just means "add an extension type" and maybe querying some metadata). For Snowflake that would require parsing GeoJSON (but the Snowflake driver is written in Go which might make that easier to do). I imagine it would be useful for us to prototype the wrapper first (at the very least in Python or R).
I just revisted this in the context of https://github.com/apache/arrow/pull/41823 and thought I'd update the reprex here with some improvements to the adbcdrivermanager R package since this happened:
library(adbcdrivermanager)
uri <- "postgresql://localhost:5432/postgres?user=postgres&password=password"
db <- adbc_database_init(adbcpostgresql::adbcpostgresql(), uri = uri)
con <- adbc_connection_init(db)
con |>
read_adbc("SELECT geom FROM basin_point") |>
nanoarrow::infer_nanoarrow_schema()
#> <nanoarrow_schema struct>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 2
#> $ children :List of 1
#> ..$ geom:<nanoarrow_schema binary>
#> .. ..$ format : chr "z"
#> .. ..$ name : chr "geom"
#> .. ..$ metadata :List of 1
#> .. .. ..$ ADBC:postgresql:typname: chr "geometry"
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> $ dictionary: NULL
Created on 2024-05-28 with reprex v2.1.0
I'm wondering if anyone has given any thought to how a user might extend ADBC's default types.
In particular, I'm curious if it would be possible to add support for PostGIS's geospatial types (via the geoarrow encoding). I'm guessing these sort of extension types wouldn't make sense to declare as part of arrow-adbc itself; would there maybe be a way to inject other type handling into the
TypeMapping
?