aljazerzen / connector_arrow

Apache Arrow database client for many databases.
https://docs.rs/connector_arrow
MIT License
37 stars 1 forks source link

Expose upstream schema #18

Open kylebarron opened 5 months ago

kylebarron commented 5 months ago

👋 I saw you were looking for feedback and I was curious whether you have support for or are interested in exposing support for user-defined data types? In particular, I'm working on building out support for geospatial in arrow in https://github.com/geoarrow/geoarrow-rs. We have a working but limited implementation that reads from PostGIS databases. It directly connects to sqlx, and it would be great to use a library like yours that focuses on converting database tables -> arrow. But we need to be able to access the type name, which is "geometry" or "geography" on a PostGIS column.

I haven't looked through your API yet, but I think one way this could work is if you exposed both the Postgres schema and the inferred Arrow schema? Because then geoarrow could access the upstream schema and know that the BinaryArray column actually represents geospatial data.

aljazerzen commented 5 months ago

Oh, interesting. So geoarrow-rs defines new types of array for each of the geometry types?

To answer your question: yes it is totally doable to expose the original type of a column in a query response.

A sidenote: for Postgres, response types are returned just as an u32 oid, so we'd need to make additional queries to the database to figure out the actual name.

In this scenario, connector_arrow would use a Postgres Client to make a request, then convert PostGIS data into plain arrow types and then pass that to geoarrow-rs, right? So POINT would probably be converted into a List of Float64? And POLYGON would probably be converted into a List of List of Float64?

This does add another step into the data pipeline, but it would be the easiest to implement.


The title of the post suggests something different: to dynamically specify how certain database type have to be converted into arrow types.

For the record, this is much harder to do, since it needs to plug into my machinery of mapping types and providing data converters.

kylebarron commented 5 months ago

Oh, interesting. So geoarrow-rs defines new types of array for each of the geometry types?

Yeah, geoarrow-rs follows the GeoArrow specification. GeoArrow follows the same "Simple Features" spec as PostGIS, so GeoArrow also has Point, LineString, Polygon, etc types like PostGIS, but they're stored differently. PostGIS has a custom internal format IIRC while yes a two-dimensional PointArray is physically a FixedSizeList[2, Float64]. (A Polygon array has two levels of nesting above the coordinates so it's List[List[FixedSizeList[2, Float64]]]).

In this scenario, connector_arrow would use a Postgres Client to make a request, then convert PostGIS data into plain arrow types and then pass that to geoarrow-rs, right?

PostGIS data comes over the wire as Well-Known Binary, which packs each row into a well-specified binary format. This would come across as an Arrow BinaryArray. I'm not suggesting that connector_arrow manage conversion from the binary format to the arrow-native PointArray etc; that's all done in geoarrow-rs. Rather, we just need to know the original Postgres schema so that we know a given binary column actually represents geometries.

A sidenote: for Postgres, response types are returned just as an u32 oid, so we'd need to make additional queries to the database to figure out the actual name

Yeah... I'm a little hazy on those details. I think sqlx might sometime make those additional queries transparently to the user.

The title of the post suggests something different: to dynamically specify how certain database type have to be converted into arrow types.

I don't think I need anything truly dynamic.

kylebarron commented 5 months ago

One potential way to do this, though it would have less type safety, would be to include upstream metadata on the Arrow field metadata. Then you wouldn't need to change the API presumably.

aljazerzen commented 5 months ago

... include upstream metadata on the Arrow field metadata

PostGIS data comes over the wire as Well-Known Binary

Oh this is great! So connector_arrow will just be converting the binary buffers from Postgres wire protocol into BinaryArray and attaching the name of the original type into the metadata.

I wouldn't say that this is less type-safe.

I also like this approach because it also works with any other Postgres extension or built-in non-trivial type. It makes connector_arrow only deal with converting rows into RecordBatches and leaves custom type deserialization to downstream crates. Do one thing and do it well :D

aljazerzen commented 5 months ago

A sidenote: for Postgres, response types are returned just as an u32 oid, so we'd need to make additional queries to the database to figure out the actual name

Yeah... I'm a little hazy on those details. I think sqlx might sometime make those additional queries transparently to the user.

Ignore this, I was wrong. This is handled by postgres crate (or rather tokio-postgres), as it should be:

https://github.com/sfackler/rust-postgres/blob/master/tokio-postgres/src/prepare.rs#L134-L186

aljazerzen commented 5 months ago

Implemented in e4fd3ca74965c8a2f5237b8352612623a68c9454

@kylebarron, please take a look and verify that this covers your use case.