apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.63k stars 3.56k forks source link

[C++] Import of Map type via the C Data interface drops child field metadata #44714

Closed paleolimbot closed 3 days ago

paleolimbot commented 1 week ago

Describe the bug, including details regarding any error messages, version, and platform.

When Map types are received via the C Data interface, field metadata (including extension metadata) is dropped. This seems unintentional given that we maintain that metadata for a list of structs:

import duckdb

duckdb_cursor = duckdb.connect()
duckdb_cursor.execute("SET arrow_lossless_conversion = true")
arrow_table = duckdb_cursor.execute("select map {uuid(): 1::uhugeint, uuid(): 2::uhugeint} as li").arrow()
res = duckdb_cursor.execute("select typeof(li) FROM arrow_table").fetchall()
print ("map type")
print (arrow_table.schema)
print (res)
# map type
# li: map<fixed_size_binary[16], fixed_size_binary[16]>
#   child 0, entries: struct<key: fixed_size_binary[16] not null, value: fixed_size_binary[16]> not null
#       child 0, key: fixed_size_binary[16] not null
#       child 1, value: fixed_size_binary[16]
# [('MAP(BLOB, BLOB)',)]

arrow_table = duckdb_cursor.execute("select [{'keys': uuid(), 'values': uuid()}] as li").arrow()
res = duckdb_cursor.execute("select typeof(li) FROM arrow_table").fetchall()
print ("fixed size list type")
print (arrow_table.schema)
print (res)
# map type
# li: list<l: struct<keys: fixed_size_binary[16], values: fixed_size_binary[16]>>
#   child 0, l: struct<keys: fixed_size_binary[16], values: fixed_size_binary[16]>
#       child 0, keys: fixed_size_binary[16]
#       -- field metadata --
#       ARROW:extension:metadata: ''
#       ARROW:extension:name: 'arrow.uuid'
#       child 1, values: fixed_size_binary[16]
#       -- field metadata --
#       ARROW:extension:metadata: ''
#       ARROW:extension:name: 'arrow.uuid'
# [('STRUCT(keys UUID, "values" UUID)[]',)]

This occurs because we reconstruct the fields to canonicalize the field names:

https://github.com/apache/arrow/blob/d7bc3788ea2773399b7ef489438c725999bfa83d/cpp/src/arrow/c/bridge.cc#L1298-L1321

I think that we don't have that problem in the IPC type conversion:

https://github.com/apache/arrow/blob/d7bc3788ea2773399b7ef489438c725999bfa83d/cpp/src/arrow/ipc/metadata_internal.cc#L393-L395

Component(s)

C++

pitrou commented 3 days ago

Issue resolved by pull request 44715 https://github.com/apache/arrow/pull/44715