Open ttomasz opened 9 months ago
Arrow doesn't have a native range type. Though it should've fallen back to giving you the raw bytes.
What would be a useful way to encode the data?
I imagine you could model it as a structure of two values and two flags. More generally, PostgreSQL seems to have a long tail of exotic data types and it's probably helpful to have generic guidance for those.
Yes, well, the question is if it might be worth trying to define canonical extension types for each of these, or if we're going to just do something ad-hoc in this driver specifically
What would be a useful way to encode the data?
Probably a struct with lower and upper bound timestamps and boolean flags indicating inclusivity like Curt mentioned. It could also be dumped in text representation that would be castable back to range type if loaded back to postgres.
I don't know if the latter could be made automatically if type is not recognized.
The extension type would make it easier for us to round trip data, though, if that's expected. (Also, it's the kind of thing that's harder to change down the line.)
We can't figure out the string representation automatically. We get whatever bytes Postgres gives us and have to interpret it.
This has been discussed before if someone would like to advance the discussion.
https://github.com/apache/arrow/issues/28389 https://lists.apache.org/thread/ybrm5m27xc163nmjzn9zwsvbq6cfc0x0
That said, I'm sympathetic to Curt's point that there's going to be quite a few of these, and indeed I'm not sure every such type would want/need a dedicated extension type. CC @paleolimbot @zeroshade for opinions too.
I could see an argument for having a non-canonical extension type (that we just define as part of the Postgres driver), but working with extension types may be a little awkward downstream?
Though it should've fallen back to giving you the raw bytes.
The only other time I've experienced this is when I created a type in the Postgres driver using the same underlying "database" object. We cache the type table in the database object and we don't currently provide a way to invalidate that. It could also be that something about the logic that queries the type table misses the range type, or there is an error when inserting it into the "type resolver". The place where that gets built is here:
I imagine you could model it as a structure of two values and two flags.
I agree that a struct would be the most intuitive. I would have to check how this comes through in COPY output, but I imagine you would need four columns.
the question is if it might be worth trying to define canonical extension types for each of these, or if we're going to just do something ad-hoc in this driver specifically
You could probably do it with one extension type (maybe arrow.adbc.postgres
), whose serialization held the type OID (which is basically all we have at the point we start constructing a result). Field metadata would do about the same thing except with an extension type you can customize how those arrays are converted to Python objects.
What we could do about this in the short-term is:
Hmm, I like the idea of a single generic extension type that just flags "hey, we don't know how to interpret this, but here are the raw bytes". I also like that over just field metadata since the field metadata is a bit of an ad-hoc convention vs. extension types which are actually meant for this sort of thing (I think).
Make sure there's an option to return the type OID and/or name as field metadata for all fields (so that consumers can implement workarounds if they need more granularity on Postgres type)
I wouldn't be opposed to just always doing this (this came up elsewhere too)
Return range types as structs (and make sure they make it into the type resolver)
I'm OK with just encoding as struct, then. Possibly also still wrapped under said generic extension type?
What happened?
On arrow table fetch I got error saying that type 3910 ( tstzrange ) is not implemented
How can we reproduce the bug?
Create a table with timestamp range and try to read it using cursor.fetch_arrow_table()
Environment/Setup
PostgreSQL driver version 0.9.0 Python 3.9 pyarrow 15.0.0 Windows pip