apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.56k forks source link

[Python] Can't convert from df column with type uuid.UUID to str #44224

Open davidsteinar opened 1 month ago

davidsteinar commented 1 month ago

Describe the bug, including details regarding any error messages, version, and platform.

See reproducible example:

import pandas as pd
import uuid
import pyarrow as pa

# Create a DataFrame with UUID objects
data = {'MUID': [uuid.uuid4() for _ in range(5)],
        'Data': range(5)}

df = pd.DataFrame(data)

# Convert the DataFrame to an Arrow table
table = pa.Table.from_pandas(df)

>>> ArrowInvalid: ("Could not convert UUID('ffb0c97b-7a25-4fce-9e4e-645715ca5ae8') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column MUID with type object')

Component(s)

Python

amoeba commented 1 month ago

Hello @davidsteinar, thanks for the report. PyArrow doesn't support this at the moment but there's already an issue to track that work: https://github.com/apache/arrow/issues/43855.

As a workaround until then, you can call .bytes on your uuid.UUID objects and then PyArrow will infer the column type as binary:

In [1]: import pandas as pd
   ...: import uuid
   ...: import pyarrow as pa
   ...:
   ...: # Create a DataFrame with UUID objects
   ...: data = {'MUID': [uuid.uuid4().bytes for _ in range(5)],     <---- Note: .bytes called on each
   ...:         'Data': range(5)}
   ...:
   ...: df = pd.DataFrame(data)
   ...:
   ...: # Convert the DataFrame to an Arrow table
   ...: pa.Table.from_pandas(df)
Out[1]:
pyarrow.Table
MUID: binary
Data: int64
----
MUID: [[D3C9E28D1AF14833A765F3389F6E9CEF,0F75A9DAFECF438692840042DEDD4B7F,C25544BA12DD4EBC8701FD1178A502B7,CE2CDBF58BA4454CAED6BF0F54886BC4,E4866E83863240EF81B68129B5BB186D]]
Data: [[0,1,2,3,4]]