Open kylebarron opened 8 months ago
Oh, interesting. So geoarrow-rs defines new types of array for each of the geometry types?
To answer your question: yes it is totally doable to expose the original type of a column in a query response.
A sidenote: for Postgres, response types are returned just as an u32
oid, so we'd need to make additional queries to the database to figure out the actual name.
In this scenario, connector_arrow
would use a Postgres Client
to make a request, then convert PostGIS data into plain arrow types and then pass that to geoarrow-rs, right?
So POINT would probably be converted into a List
of Float64
?
And POLYGON would probably be converted into a List
of List
of Float64
?
This does add another step into the data pipeline, but it would be the easiest to implement.
The title of the post suggests something different: to dynamically specify how certain database type have to be converted into arrow types.
For the record, this is much harder to do, since it needs to plug into my machinery of mapping types and providing data converters.
Oh, interesting. So geoarrow-rs defines new types of array for each of the geometry types?
Yeah, geoarrow-rs follows the GeoArrow specification. GeoArrow follows the same "Simple Features" spec as PostGIS, so GeoArrow also has Point, LineString, Polygon, etc types like PostGIS, but they're stored differently. PostGIS has a custom internal format IIRC while yes a two-dimensional PointArray
is physically a FixedSizeList[2, Float64]
. (A Polygon array has two levels of nesting above the coordinates so it's List[List[FixedSizeList[2, Float64]]]
).
In this scenario,
connector_arrow
would use a PostgresClient
to make a request, then convert PostGIS data into plain arrow types and then pass that to geoarrow-rs, right?
PostGIS data comes over the wire as Well-Known Binary, which packs each row into a well-specified binary format. This would come across as an Arrow BinaryArray. I'm not suggesting that connector_arrow
manage conversion from the binary format to the arrow-native PointArray etc; that's all done in geoarrow-rs. Rather, we just need to know the original Postgres schema so that we know a given binary column actually represents geometries.
A sidenote: for Postgres, response types are returned just as an
u32
oid, so we'd need to make additional queries to the database to figure out the actual name
Yeah... I'm a little hazy on those details. I think sqlx might sometime make those additional queries transparently to the user.
The title of the post suggests something different: to dynamically specify how certain database type have to be converted into arrow types.
I don't think I need anything truly dynamic.
One potential way to do this, though it would have less type safety, would be to include upstream metadata on the Arrow field metadata. Then you wouldn't need to change the API presumably.
... include upstream metadata on the Arrow field metadata
PostGIS data comes over the wire as Well-Known Binary
Oh this is great! So connector_arrow
will just be converting the binary buffers from Postgres wire protocol into BinaryArray
and attaching the name of the original type into the metadata.
I wouldn't say that this is less type-safe.
I also like this approach because it also works with any other Postgres extension or built-in non-trivial type. It makes connector_arrow
only deal with converting rows into RecordBatches and leaves custom type deserialization to downstream crates. Do one thing and do it well :D
A sidenote: for Postgres, response types are returned just as an u32 oid, so we'd need to make additional queries to the database to figure out the actual name
Yeah... I'm a little hazy on those details. I think sqlx might sometime make those additional queries transparently to the user.
Ignore this, I was wrong. This is handled by postgres
crate (or rather tokio-postgres
), as it should be:
https://github.com/sfackler/rust-postgres/blob/master/tokio-postgres/src/prepare.rs#L134-L186
Implemented in e4fd3ca74965c8a2f5237b8352612623a68c9454
@kylebarron, please take a look and verify that this covers your use case.
👋 I saw you were looking for feedback and I was curious whether you have support for or are interested in exposing support for user-defined data types? In particular, I'm working on building out support for geospatial in arrow in https://github.com/geoarrow/geoarrow-rs. We have a working but limited implementation that reads from PostGIS databases. It directly connects to
sqlx
, and it would be great to use a library like yours that focuses on converting database tables -> arrow. But we need to be able to access the type name, which is"geometry"
or"geography"
on a PostGIS column.I haven't looked through your API yet, but I think one way this could work is if you exposed both the Postgres schema and the inferred Arrow schema? Because then geoarrow could access the upstream schema and know that the BinaryArray column actually represents geospatial data.