delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.2k stars 395 forks source link

Creating checkpoints for tables with missing column stats results in Err #2493

Closed shanisolomon closed 2 months ago

shanisolomon commented 4 months ago

Delta-rs version: 0.16.5


Bug

In a table with > 32 colums, when trying to create a checkpoint in a delta table (using checkpoints::create_checkpoint API) that contains transaction log written by Spark, which only includes stats for 32 columns by default rather than for all columns, we're getting the following Err: Failed to convert into Arrow schema: Json error: whilst decoding field 'add': whilst decoding field 'stats_parsed': whilst decoding field 'minValues': Encountered unmasked nulls in non-nullable StructArray child: < child>.

I suspect it's either a bug in arrow-json package that for some reason receive null pos for the overflowing columns when decoding the transaction log statistics, or perhaps it's a bug in the 'add' transaction json created by delta.rs during checkpoint in which the schema contains > 32 columns, but the 'stats_parsed' json does not have a corresponding value to all columns.

What you expected to happen: I expect to be able to construct the Arrow Json schema when stats are not present for all columns, and more broadly - to be able to create a checkpoint file using delta.rs library after Spark has optimized the table.

How to reproduce it: A table with > 32 columns that Spark engine ran OPTIMIZE transaction on, which doesn't include stats for all fields. The delta log itself is enough to repro this issue. I'm able to provide necessary example files if needed.

More details: If this is intended behavior and not a bug, please let me know. Thanks in advance!

sherlockbeard commented 2 months ago

resolved in #2675 @ion-elgreco