dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.7k stars 180 forks source link

PyArrow backend cannot handle MySQL set type #2089

Open karakanb opened 1 day ago

karakanb commented 1 day ago

dlt version

1.4.0

Describe the problem

When ingesting tables that contain set fields in MySQL tables dlt fails to convert them to arrow, due to the type not being supported by PyArrow.

  File "/path/dlt/common/libs/pyarrow.py", line 685, in row_tuples_to_arrow
    return pa.Table.from_pydict(columnar_known_types, schema=arrow_schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 1920, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 6153, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 398, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 358, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'set' object

The same problem exists for lists and dicts as well, but the corresponding code handles those by casting them to string. It seems like set types are missed.

Expected behavior

MySQL set fields are correctly ingested.

Steps to reproduce

try ingesting a table that contains a set using pyarrow backend

create table test.some_table
(
    order_id int auto_increment primary key,
    col1     set ('0', '1') default '0' not null,
    col2     set ('1', '2') default '2' not null
)

Operating system

Linux, macOS, Windows

Runtime environment

Local

Python version

3.11

dlt data source

sql_table

dlt destination

Google BigQuery, DuckDB, Filesystem & buckets, Postgres, Amazon Redshift, Snowflake

Other deployment details

No response

Additional information

the only workaround at the moment is not using pyarrow. I am submitting a fix for this at the moment.