Open the-mousaillon opened 1 year ago
Also, I only have this issue if I use approx_percentile_cont on an INT64 column
Thank you @the-mousaillon for reporting the bug. We will debug it carefully. Also, welcome to contribute your fix to Datafusion!
I can reproduce the bug printed here.
cc @tustvold, could you please take a look?
I can add it to my list, although I'm not familiar with the code in question. Perhaps @domodwyer or @crepererum may have some ideas
Edit: does the array contain nulls?
The age
column contains null
s
Without having spent any time looking into the code, yet, my hunch would be that the accumulator is handling nulls incorrectly, and the s: 72
"value" is actually a null.
I've tested again and @tustvold I guess you are right.
The ApproxPercentileAccumulator::convert_to_float
seems to ignore the null buffer when casting an array to vec<f64>
Yeah I don't remember doing anything with NULL masks when I wrote this.
Hi @domodwyer, could you please to fix this?
BTW, the estimate quantile algorithm doesn't follow the paper
, any reason for this?
https://github.com/apache/arrow-datafusion/blob/df8aa7a2e2a6f54acfbfed336b84144256fb7ff8/datafusion/physical-expr/src/aggregate/tdigest.rs#L523-L524
Hi @HaoYang670,
I do not have time to fix this in the short term as I am working on a project. I can take a look in a few weeks - feel free to fix this yourself if you need it sooner :+1:
Describe the bug approx_quantile_cont panics, complaining that the input to TDigest is not ordered:
panicked at 'unsorted input to TDigest'
I have done some digging to understand what happens and it seams to have something to do with a corruption of indexes whithin the arrow Array.
I added a simple check to see if the array is ordered, and if not to print it in the update_batch function
The output is surprising, we see that in seemingly every case, the first value of the sorted array "s" should be at the end of the array. For intance on this array :![image](https://user-images.githubusercontent.com/36140579/202452429-3af4bf33-88c2-4580-a50f-4ee9b26dec28.png)
The weird thing is that if I recreate the array, the sort works properly and the panic goes away.
This makes me think that their may be some kind of index corruption within the array buffer.
I had this bug while performing an approx_percentile_cont on a GROUP BY, whithout the GROUP BY it works fine.
To Reproduce Steps to reproduce:
Download the parquet file: percentile_cont_bug.zip
Execute this function
Additional context datafusion version: 14.0.0 arrow version: 26.0 platform: WSL (ubuntu)