apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.62k stars 3.55k forks source link

[C++][Python] (De)serialize Arrow UUID to/from Parquet UUID #43807

Open ianmcook opened 2 months ago

ianmcook commented 2 months ago

Describe the enhancement requested

As a follow up to issue #15058 / PR #37298:

Parquet's UUID logical type is directly equivalent to Arrow's UUID canonical extension type.

After we have native support for UUID in Arrow C++ and PyArrow, it would be lovely if:

This would improve interoperability with other components that read/write Parquet files and support Parquet's UUID type.

Component(s)

C++, Python

ianmcook commented 2 months ago

@joellubi are you doing the equivalent of this for the Go implementation in #43679?

mapleFU commented 2 months ago

Should we wait [1] is merged?

[1] https://github.com/apache/arrow/pull/37298

ianmcook commented 2 months ago

@mapleFU Yes, or else use that as the base branch

joellubi commented 2 months ago

@joellubi are you doing the equivalent of this for the Go implementation in #43679?

Yes that PR adds the capability for arrow extension types to specify their target parquet logical type, and implements it for UUID and JSON (see relevant testcase).

The type mapping going the other way (Parquet -> Arrow) was not added for UUID, right now it uses the storage type FixedSizeBinary. I can get this added too.

ianmcook commented 2 months ago

The type mapping going the other way (Parquet -> Arrow) was not added for UUID, right now it uses the storage type FixedSizeBinary. I can get this added too.

Thanks! It would be great to have it implemented in both directions.

ianmcook commented 2 months ago

If you need a Parquet file that contains a UUID column (for testing purposes), DuckDB can write one like this:

import duckdb
con = duckdb.connect()

con.execute("CREATE TABLE t1 AS SELECT gen_random_uuid() a FROM range(0, 16);")

con.execute("copy t1 to 'uuid_test.parquet'")

Thanks @pdet for this example code.

mapleFU commented 2 months ago

I currently have no time on this in this two weeks. I'm glad to help review this

rok commented 2 months ago

For reference C++ JSON extension type proposal already includes Parquet serialization.

ianmcook commented 2 months ago

https://github.com/apache/arrow/pull/37298 is merged now

mapleFU commented 2 months ago

Thanks @rok , lets fast make https://github.com/apache/arrow/pull/13901/files in . I'm focus on support List in Join this two week but I'll take careful round on this pr

raphaelauv commented 2 weeks ago

is this done with arrow 18 and UUID support ?

rok commented 2 weeks ago

UUID extension type is supported in arrow 18. I don't think it'll get serialized to UUID logical type in Parquet like JSON does (to JSON logical type), but I'd expect it to roundtrip ok to parquet in some cases. What case are you looking to cover @raphaelauv ?