Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.36k stars 169 forks source link

faulty reading hudi table after it has beed altered #2941

Open sephib opened 2 months ago

sephib commented 2 months ago

Describe the bug

General Description While reading data from a hudi table with daft.read_hudi() we are getting an error that is caused due to a miss-match of the columns.

Setup

  1. Our Hudi COW table is hosted on S3.
  2. After the initial setup, we modified the table by updating the Avro schema (.avsc file) and adding two new string columns in between our existing columns schema. For new data, these columns started populating correctly, while for existing rows, these new columns are null.

When running

dfd = daft.read_hudi('s3://path/to/hudi')
dfd.columns   # this is Ok and returns correct column names
dfd.schema()  # this is OK and returns correct schema

dfd.show()  # we get an error
> ArrowTypeError: Expected XXXX, got a YYYY  object

# Or on a different altered hudi table
> ArrowInvalid: Could not convert 'Florida' with type str: tried to convert to double

In both cases the type it is trying to use is the one prior to the alteration of the table

When reading a hudi table that has not been altered there is no problem.

We are using

getdaft==0.3.3 hoodie.table.version=5

Any suggestions?

jaychia commented 2 months ago

Hello! This might be a pyhudi error -- cc @xushiyan from the Hudi team for any thoughts

We are currently awaiting the Hudi team's implementation of Hudi-rs which would give us more robust support for Hudi

sephib commented 1 month ago

Just adding additional context It seams to be an Avro vs. Arrow issue. When trying to use hudi-rs we also get an error:

ArrowInvalid: Schema at index X was different

this is an example of what we are running

from hudi import HudiTable  # pip install hudi
import pyarrow as pa

hudi_path=f's3://path/to/hudi/table'
hudi_table = HudiTable(hudi_path)
records = hudi_table.read_snapshot()
arrow_table = pa.Table.from_batches(records)

The schema of the records are not the same.

jaychia commented 1 month ago

Ok yeah this might be a Hudi issue in general then. Do you mind filing an issue against hudi-rs and linking that issue here please @sephib ?