delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.15k stars 385 forks source link

Pyarrow ValueError: all columns in a record batch must have the same length #1981

Closed spretto closed 4 weeks ago

spretto commented 9 months ago

Environment

Delta-rs version: python-v0.14.0

Binding: Python

Environment:


Bug

What happened: Error message when trying to run: dt = DeltaTable("/path/to/table") dt..to_pyarrow_dataset() Traceback (most recent call last): File "", line 1, in File "/home/user/.local/lib/python3.9/site-packages/deltalake/table.py", line 866, in to_pyarrow_table return self.to_pyarrow_dataset( File "/home/user/.local/lib/python3.9/site-packages/deltalake/table.py", line 809, in to_pyarrow_dataset file_sizes = self.get_add_actions().to_pydict() File "/home/user/.local/lib/python3.9/site-packages/deltalake/table.py", line 964, in get_add_actions return self._table.get_add_actions(flatten) ValueError: all columns in a record batch must have the same length

What you expected to happen: Expecting a pyarrow dataset

More details: Used to work fine before updating pyspark/delta table/and delta-rs package.

ion-elgreco commented 9 months ago

We need some more details to be able to help here, can you provide a minimal reproducible example?

spretto commented 9 months ago

I want to provide a minimal reproducible example, but it seems related to my delta table. After writing new rows to it with pyspark 3.5.0 and delta jar "io.delta:delta-spark_2.12:3.0.0', I can't read it anymore with pyarrow. Other delta tables are still ok.

ion-elgreco commented 9 months ago

@spretto what kind of write action did you do?

spretto commented 9 months ago

I added new rows using the existing schema. I can see these new rows in spark with the correct partitions, but get this error when I try to load them as a pyarrow dataset. Same error when I try to load old partitions as well (any part of the delta table)

ion-elgreco commented 9 months ago

Can you try to reproduce it with the smallest sample table as possible and then share the table and transaction log?