apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

Java Arrow RecordBatch might have logical row count which is not same as physical row count in the arrays #973

Closed viirya closed 1 month ago

viirya commented 1 month ago

Describe the bug

Integrating Comet with Iceberg internally gets the following error if there are deleted rows in the Iceberg table:

org.apache.comet.CometNativeException: Invalid argument error: all columns in a record batch must have the specified row count

It is because Iceberg stores row mappings in its CometVector implementations and uses it to skip deleted rows during iterating rows in a batch. The row values in arrays are not actually deleted. The Iceberg batch reader sets the number of rows of a returned record batch to be the "logical" number of rows after deletion. It is okay for Java Arrow.

However, arrow-rs has stricter check on the lengths of arrays and row number parameter. Once it detects they are inconsistent, an error like above will be thrown.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response