apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
402 stars 147 forks source link

Bug Fix: Position Deletes + row_filter yields less data when the DataFile is large #1141

Closed sungwy closed 4 days ago

sungwy commented 3 weeks ago

Fixes: #1132

Culprit was the awkward zero-copy pass back and forth between RecordBatch and Table to filter by pyarrow_filter.

https://github.com/apache/iceberg-python/blob/0dc54080aa287dc8e920da128d7f4b335965f1df/pyiceberg/io/pyarrow.py#L1246-L1254

As noted, this is necessary because passing an expression filter to a RecordBatch hasn't yet been exposed (it is now in 17.0.0, but that would raise the lower limit of PyArrow dependency that much higher).

In cases where a single RecordBatch -> a Table -> multiple RecordBatches because of how Arrow automatically chunks a Table into multiple RecordBatches, we would lose the remaining RecordBatches in the returned output. Essentially, there's no guarantee that a single RecordBatch will remain a single one in a round trip conversion to Arrow Table, and back.

The new test covers a case where the amount of data within a DataFile is large enough for the arrow Table yields multiple RecordBatch

kevinjqliu commented 2 weeks ago

In cases where a single RecordBatch -> a Table -> multiple RecordBatches because of how Arrow automatically chunks a Table into multiple RecordBatches, we would lose the remaining RecordBatches in the returned output. Essentially, there's no guarantee that a single RecordBatch will remain a single one in a round trip conversion to Arrow Table, and back.

This is specifically about this piece of code?

     batch = arrow_table.to_batches()[0] 
sungwy commented 2 weeks ago

In cases where a single RecordBatch -> a Table -> multiple RecordBatches because of how Arrow automatically chunks a Table into multiple RecordBatches, we would lose the remaining RecordBatches in the returned output. Essentially, there's no guarantee that a single RecordBatch will remain a single one in a round trip conversion to Arrow Table, and back.

This is specifically about this piece of code?

     batch = arrow_table.to_batches()[0] 

Yes, that's right - it was so hard to understand this behavior and test it until we found users with large enough data to actually run into the issues

In retrospect, I think assuming that it will always be length of 1 is erroneous, but it wasn't easy to conceive that a RecordBatch would yield more RecordBatches in a zero-copy roundtrip conversion to Arrow Table and back