Closed viirya closed 1 month ago
The bug is at arrow-rs, but we can add a test in Comet.
as Comet reuses underlying buffer
Mutating a buffer without exclusive ownership is UB, Comet should be using https://docs.rs/arrow-buffer/latest/arrow_buffer/buffer/struct.Buffer.html#method.into_mutable or similar if it wishes to do this safely
Describe the bug
We found an interesting bug recently. For some cases,
Dataset.show
andDataset.collectAsList
return different results. We investigated the bug and found that it is due to the implementation oftake_bytes
.In the cases, Comet reads a dictionary array of string. It unpacks dictionary array to string array. In a query where
TopK
operator is used, the operator will store input arrays into internal store and emit after all inputs are consumed. In Comet, the output arrays from scan reuse same buffers across batches. For operators that cache input arrays, Comet will do deep copy on these arrays.However, when unpacking dictionary array to string array by calling
take_bytes
, if the indices array has no null,take_bytes
kernel simply takes a full slice of the null buffer of indices (i.e., reusing it) as the null buffer of output array. So in the next batch, once the null buffer is updated (as Comet reuses underlying buffer), the stored array inTopK
operator is also changed. It makes the query result indeterministic.Consider the semantics of
take
kernel, its output array should not reuse input array. The current behavior looks incorrect.We are going to fix it at the arrow-rs: https://github.com/apache/arrow-rs/issues/6617
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response