Open ianmcook opened 2 months ago
@joellubi are you doing the equivalent of this for the Go implementation in #43679?
Should we wait [1] is merged?
@mapleFU Yes, or else use that as the base branch
@joellubi are you doing the equivalent of this for the Go implementation in #43679?
Yes that PR adds the capability for arrow extension types to specify their target parquet logical type, and implements it for UUID and JSON (see relevant testcase).
The type mapping going the other way (Parquet -> Arrow) was not added for UUID, right now it uses the storage type FixedSizeBinary
. I can get this added too.
The type mapping going the other way (Parquet -> Arrow) was not added for UUID, right now it uses the storage type
FixedSizeBinary
. I can get this added too.
Thanks! It would be great to have it implemented in both directions.
If you need a Parquet file that contains a UUID column (for testing purposes), DuckDB can write one like this:
import duckdb
con = duckdb.connect()
con.execute("CREATE TABLE t1 AS SELECT gen_random_uuid() a FROM range(0, 16);")
con.execute("copy t1 to 'uuid_test.parquet'")
Thanks @pdet for this example code.
I currently have no time on this in this two weeks. I'm glad to help review this
For reference C++ JSON extension type proposal already includes Parquet serialization.
https://github.com/apache/arrow/pull/37298 is merged now
Thanks @rok , lets fast make https://github.com/apache/arrow/pull/13901/files in . I'm focus on support List in Join this two week but I'll take careful round on this pr
is this done with arrow 18 and UUID support ?
UUID extension type is supported in arrow 18. I don't think it'll get serialized to UUID logical type in Parquet like JSON does (to JSON logical type), but I'd expect it to roundtrip ok to parquet in some cases. What case are you looking to cover @raphaelauv ?
Describe the enhancement requested
As a follow up to issue #15058 / PR #37298:
Parquet's UUID logical type is directly equivalent to Arrow's UUID canonical extension type.
After we have native support for UUID in Arrow C++ and PyArrow, it would be lovely if:
This would improve interoperability with other components that read/write Parquet files and support Parquet's UUID type.
Component(s)
C++, Python