apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
826 stars 164 forks source link

`TopK` operator (i.e. `CometTakeOrderedAndProjectExec`) may return incorrect result #1030

Closed viirya closed 1 month ago

viirya commented 1 month ago

Describe the bug

We found an interesting bug recently. For some cases, Dataset.show and Dataset.collectAsList return different results. We investigated the bug and found that it is due to the implementation oftake_bytes.

In the cases, Comet reads a dictionary array of string. It unpacks dictionary array to string array. In a query where TopK operator is used, the operator will store input arrays into internal store and emit after all inputs are consumed. In Comet, the output arrays from scan reuse same buffers across batches. For operators that cache input arrays, Comet will do deep copy on these arrays.

However, when unpacking dictionary array to string array by calling take_bytes, if the indices array has no null, take_bytes kernel simply takes a full slice of the null buffer of indices (i.e., reusing it) as the null buffer of output array. So in the next batch, once the null buffer is updated (as Comet reuses underlying buffer), the stored array in TopK operator is also changed. It makes the query result indeterministic.

Consider the semantics of take kernel, its output array should not reuse input array. The current behavior looks incorrect.

We are going to fix it at the arrow-rs: https://github.com/apache/arrow-rs/issues/6617

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

viirya commented 1 month ago

The bug is at arrow-rs, but we can add a test in Comet.

tustvold commented 1 month ago

as Comet reuses underlying buffer

Mutating a buffer without exclusive ownership is UB, Comet should be using https://docs.rs/arrow-buffer/latest/arrow_buffer/buffer/struct.Buffer.html#method.into_mutable or similar if it wishes to do this safely