apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
384 stars 96 forks source link

Support for Postgres extension types? #546

Open kylebarron opened 1 year ago

kylebarron commented 1 year ago

I'm wondering if anyone has given any thought to how a user might extend ADBC's default types.

In particular, I'm curious if it would be possible to add support for PostGIS's geospatial types (via the geoarrow encoding). I'm guessing these sort of extension types wouldn't make sense to declare as part of arrow-adbc itself; would there maybe be a way to inject other type handling into the TypeMapping?

paleolimbot commented 1 year ago

Do you happen to know how a POSTGIS geometry arrives over the wire via COPY? (I suppose one could open a debugger and check...I think it might be the bytes of the gserialized version of the geometry). I obviously have a vested interest in making this a thing (also for SQLite + GeoPackage, which has a similar issue where the way the geometry arrives by default is not quite how we'd want it represented in query output.

One approach would be to fork the Postgres and SQLite drivers in the geoarrow org and re-package them with geometry support.

kylebarron commented 1 year ago

Do you happen to know how a POSTGIS geometry arrives over the wire via COPY?

I don't. That indeed seems like the first thing to establish. I thought that it serialized to WKB by default but maybe that happens on the client and not on the wire itself.

One approach would be to fork the Postgres and SQLite drivers in the geoarrow org and re-package them with geometry support.

Would that work well with ADBC? Forking postgres and sqlite drivers seems like a big undertaking. I was imagining/hoping it should be possible to have some small function that just "plugs in" and converts to geoarrow without a large fork

paleolimbot commented 1 year ago

I agree that would be ideal although I suspect (but don't know) that extensibility isn't likely to be a target for a bit and so I wonder if the first target should be a workaround of some kind.

Another workaround would be to write a driver-that-wraps-a-driver...basically make a driver wrapper that passes along all the method calls but wraps the output ArrowArrayStream with an implementation that parses the raw output, releases the source array, and replaces it with the parsed version. Not ideal because there's one more copy than is theoretically necessary, but would at least get everybody's workflow sorted (e.g., use the ADBC postgis driver and get geoarrow output!).

kylebarron commented 1 year ago

Another workaround would be to write a driver-that-wraps-a-driver...basically make a driver wrapper that passes along all the method calls but wraps the output ArrowArrayStream with an implementation that parses the raw output, releases the source array, and replaces it with the parsed version

That sounds ideal to me to get started. I guess the first thing to check is what the current ADBC driver behavior is with postgis types. Would/could it return a BinaryArray with whatever blobs are returned from the postgis wire format, which a wrapper could convert into geoarrow? Based on the type support doc I was guessing it would fail on non-default types, but I haven't tested this yet.

paleolimbot commented 1 year ago

An easier initial target is probably GeoPackage + SQLite...those values almost certainly come through as binary and I've already written the bit that strips the WKB header ( https://github.com/paleolimbot/gpkg/blob/master/src/gpkg.cpp#L196-L232 ). Then the same logic could be applied to the postgres driver. I imagine that both drivers will need some updates to pass on more information about column types as schema metadata.

lidavidm commented 1 year ago

Right now it should fail.

My thought is

paleolimbot commented 1 year ago

After #636 this should work out-of-the-box. It looks like geometry columns are sent across the wire as EWKB:

# docker run -it --rm \
#   -e POSTGRES_PASSWORD=password \
#   -e POSTGRES_USER=postgres \
#   -p 5432:5432 \
#   postgis/postgis:latest
# ogr2ogr \
#   -f PostgreSQL PG:"host=127.0.0.1 user=postgres password=password dbname=postgres" \
#   /vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-basin_point.gpkg \
#   -nln basin_point

# [libpq] Column #7 ("geom") has unknown type code 18011

library(adbcdrivermanager)

uri <- "postgresql://localhost:5432/postgres?user=postgres&password=password"
db <- adbc_database_init(adbcpostgresql::adbcpostgresql(), uri = uri)
con <- adbc_connection_init(db)

stmt <- adbc_statement_init(con)
stream <- nanoarrow::nanoarrow_allocate_array_stream()

adbc_statement_set_sql_query(stmt, "SELECT * FROM basin_point")
adbc_statement_execute_query(stmt, stream)
#> [1] -1
(result <- tibble::as_tibble(stream))
#> # A tibble: 46 × 7
#>      fid objectid feat_code basin_name river                hid             geom
#>    <int>    <dbl> <chr>     <chr>      <chr>                <chr>         <blob>
#>  1     1        1 WABA30    01EB000    BARRINGTON/CLYDE     919201D6… <raw 25 B>
#>  2     2        2 WABA30    01EC000    ROSEWAY/SABLE/JORDAN 5293753C… <raw 25 B>
#>  3     3        3 WABA30    01EA000    TUSKET RIVER         A7592A93… <raw 25 B>
#>  4     4        4 WABA30    01DA000    METEGHAN             47EF929A… <raw 25 B>
#>  5     5        5 WABA30    01ED000    MERSEY               425CA3DB… <raw 25 B>
#>  6     6        6 WABA30    01EE000    HERRING COVE/MEDWAY  C7BF151C… <raw 25 B>
#>  7     7        7 WABA30    01EG000    GOLD                 7C6E0383… <raw 25 B>
#>  8     8        8 WABA30    01EF000    LAHAVE               B3D68107… <raw 25 B>
#>  9     9        9 WABA30    01EJ000    SACKVILLE            4FA406BD… <raw 25 B>
#> 10    10       10 WABA30    01EH000    EAST/INDIAN RIVER    72FEA676… <raw 25 B>
#> # ℹ 36 more rows
wk::as_wkb(result$geom)
#> <wk_wkb[46]>
#>  [1] <SRID=900914;POINT (277022.6 4820886)>
#>  [2] <SRID=900914;POINT (315701.1 4855050)>
#>  [3] <SRID=900914;POINT (255728.5 4851021)>
#>  [4] <SRID=900914;POINT (245206.7 4895608)>
#>  [5] <SRID=900914;POINT (337143 4860311)>  
#>  [6] <SRID=900914;POINT (370526.6 4896768)>
#>  [7] <SRID=900914;POINT (396155.5 4928849)>
#>  [8] <SRID=900914;POINT (380065.5 4918188)>
#>  [9] <SRID=900914;POINT (449560.5 4927417)>
#> [10] <SRID=900914;POINT (417856.9 4932181)>
#> [11] <SRID=900914;POINT (266163.3 4950370)>
#> [12] <SRID=900914;POINT (512409.7 4958059)>
#> [13] <SRID=900914;POINT (411009.1 4970316)>
#> [14] <SRID=900914;POINT (486375.9 4946504)>
#> [15] <SRID=900914;POINT (578532 4975507)>  
#> [16] <SRID=900914;POINT (528643.6 4989779)>
#> [17] <SRID=900914;POINT (430443.1 5003879)>
#> [18] <SRID=900914;POINT (384152.3 4987549)>
#> [19] <SRID=900914;POINT (598568 5004793)>  
#> [20] <SRID=900914;POINT (468882.8 4991420)>
#> [21] <SRID=900914;POINT (664220.6 4997062)>
#> [22] <SRID=900914;POINT (316224.4 4988672)>
#> [23] <SRID=900914;POINT (566212 5021547)>  
#> [24] <SRID=900914;POINT (432768.6 5030333)>
#> [25] <SRID=900914;POINT (478632.6 5029750)>
#> [26] <SRID=900914;POINT (625039.5 5038466)>
#> [27] <SRID=900914;POINT (357606 5039295)>  
#> [28] <SRID=900914;POINT (526828 5044274)>  
#> [29] <SRID=900914;POINT (667060.1 5053682)>
#> [30] <SRID=900914;POINT (499160.7 5068941)>
#> [31] <SRID=900914;POINT (449106.9 5068988)>
#> [32] <SRID=900914;POINT (557031.5 5066144)>
#> [33] <SRID=900914;POINT (576642 5050958)>  
#> [34] <SRID=900914;POINT (406967.3 5085306)>
#> [35] <SRID=900914;POINT (606687.9 5063043)>
#> [36] <SRID=900914;POINT (418753.5 5086491)>
#> [37] <SRID=900914;POINT (653615.8 5083572)>
#> [38] <SRID=900914;POINT (395483.2 5079346)>
#> [39] <SRID=900914;POINT (668426.1 5118705)>
#> [40] <SRID=900914;POINT (614040.2 5097386)>
#> [41] <SRID=900914;POINT (720114.2 5094879)>
#> [42] <SRID=900914;POINT (697678 5137499)>  
#> [43] <SRID=900914;POINT (644382.4 5130242)>
#> [44] <SRID=900914;POINT (663219.1 5197865)>
#> [45] <SRID=900914;POINT (701578 5185235)>  
#> [46] <SRID=900914;POINT (739542 4873410)>

Created on 2023-05-03 with reprex v2.0.2

paleolimbot commented 1 year ago

FWIW, I did learn in trying this out that PostGIS has a custom send/recv definition. Geoarrow could define a "geoarrow.ewkb" extension type to acomodate this...in that case, the driver would just have to attach an extension name which is probably doable.

kylebarron commented 1 year ago

Reading through this again... this is really cool and I want to try this first-hand soon. Maybe we should move discussion to the geoarrow repo, but I wonder here how we should handle the SRID=900914. Should the onus be on the user to also query the spatial ref sys table so that it knows what the SRID defines? Otherwise the user doesn't have access to the same SRID table

paleolimbot commented 1 year ago

It's a good point...going from value-level SRID to column-level SRID is hard and would require either:

The first option could be done in a wrapper driver...the second one would have to be passed on via field metadata to be used in a wrapper driver (and wouldn't work for queries with PostGIS or the various spatial SQLite function calls). Collecting some rows from the start might be helpful when attempting to convert a geometry column to geoarrow, too (it could try to guess a single geometry type, for example).

Probably the first "geoadbc" driver should be a thin wrapper around OGR's ArrowArrayStream export? That may help establish some conventions (e.g., option names).

Worth chatting about in the geoarrow org somewhere too, but also worth considering which pieces of metadata we would need from the upstream driver to make geospatial wrappers since those changes would have to happen here.

paleolimbot commented 1 year ago

It would also be cool to handle geospatial natively where possible (e.g., where "support" just means "add an extension type" and maybe querying some metadata). For Snowflake that would require parsing GeoJSON (but the Snowflake driver is written in Go which might make that easier to do). I imagine it would be useful for us to prototype the wrapper first (at the very least in Python or R).

paleolimbot commented 5 months ago

I just revisted this in the context of https://github.com/apache/arrow/pull/41823 and thought I'd update the reprex here with some improvements to the adbcdrivermanager R package since this happened:

library(adbcdrivermanager)

uri <- "postgresql://localhost:5432/postgres?user=postgres&password=password"
db <- adbc_database_init(adbcpostgresql::adbcpostgresql(), uri = uri)
con <- adbc_connection_init(db)
con |> 
  read_adbc("SELECT geom FROM basin_point") |> 
  nanoarrow::infer_nanoarrow_schema()
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 2
#>  $ children  :List of 1
#>   ..$ geom:<nanoarrow_schema binary>
#>   .. ..$ format    : chr "z"
#>   .. ..$ name      : chr "geom"
#>   .. ..$ metadata  :List of 1
#>   .. .. ..$ ADBC:postgresql:typname: chr "geometry"
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>  $ dictionary: NULL

Created on 2024-05-28 with reprex v2.1.0