apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

approx_percentile_cont panics because array is not ordered #4259

Open the-mousaillon opened 1 year ago

the-mousaillon commented 1 year ago

Describe the bug approx_quantile_cont panics, complaining that the input to TDigest is not ordered: panicked at 'unsorted input to TDigest'

I have done some digging to understand what happens and it seams to have something to do with a corruption of indexes whithin the arrow Array.

I added a simple check to see if the array is ordered, and if not to print it in the update_batch function

    fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
        let values = &values[0];
        let v = downcast_value!(values, Int64Array);
        let sorted_values = &arrow::compute::sort(values, None)?;
        let a_down = downcast_value!(sorted_values, Int64Array);

        let mut is_sorted = true;
        for i in 0..sorted_values.len() -1 {
            if a_down.value(i) > a_down.value(i+1) {
                is_sorted = false;
                break;
            }
        }
        if !is_sorted {
           for i in 0..sorted_values.len() {
            println!("s: {}, b: {}", a_down.value(i), v.value(i));
        }
        }
        let sorted_values =
            ApproxPercentileAccumulator::convert_to_ordered_float(sorted_values)?;
        self.digest = self.digest.merge_sorted_f64(&sorted_values);
        Ok(())
    }

The output is surprising, we see that in seemingly every case, the first value of the sorted array "s" should be at the end of the array. For intance on this array : image

The weird thing is that if I recreate the array, the sort works properly and the panic goes away.

    fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
        let values = &values[0];
        # If I just recreate the array before the sorting, the problem goes away
        let v = downcast_value!(values, Int64Array);
        let values: Int64Array = v.iter().map(|d| d.clone()).collect();
        let values = Arc::new(values) as ArrayRef;
        let values = &values;
        let sorted_values =
            ApproxPercentileAccumulator::convert_to_ordered_float(sorted_values)?;
        self.digest = self.digest.merge_sorted_f64(&sorted_values);
        Ok(())
}

This makes me think that their may be some kind of index corruption within the array buffer.

I had this bug while performing an approx_percentile_cont on a GROUP BY, whithout the GROUP BY it works fine.

To Reproduce Steps to reproduce:

  1. Download the parquet file: percentile_cont_bug.zip

  2. Execute this function

fn percentile_cont_bug() ->Result<(), ()>{
    let ctx = SessionContext::new();
    let pq_path = "...path to the provided parquet file";
    ctx.register_parquet("example", pq_path, ParquetReadOptions::default())
        .await
        .map_err(|e|())?;

    let q = format!("
        SELECT 
            country,
            approx_median(age) as median_age,
        FROM example
        group by 
            country
    ");
    let df = ctx.sql(&q).await
    .map_err(|e| println!("err: {:?}", e))?;
    // execute and print results
    df.show().await;
    Ok(())
}

Additional context datafusion version: 14.0.0 arrow version: 26.0 platform: WSL (ubuntu)

the-mousaillon commented 1 year ago

Also, I only have this issue if I use approx_percentile_cont on an INT64 column

HaoYang670 commented 1 year ago

Thank you @the-mousaillon for reporting the bug. We will debug it carefully. Also, welcome to contribute your fix to Datafusion!

HaoYang670 commented 1 year ago

I can reproduce the bug printed here.

HaoYang670 commented 1 year ago

cc @tustvold, could you please take a look?

tustvold commented 1 year ago

I can add it to my list, although I'm not familiar with the code in question. Perhaps @domodwyer or @crepererum may have some ideas

Edit: does the array contain nulls?

HaoYang670 commented 1 year ago

The age column contains nulls

tustvold commented 1 year ago

Without having spent any time looking into the code, yet, my hunch would be that the accumulator is handling nulls incorrectly, and the s: 72 "value" is actually a null.

HaoYang670 commented 1 year ago

I've tested again and @tustvold I guess you are right.

HaoYang670 commented 1 year ago

The ApproxPercentileAccumulator::convert_to_float seems to ignore the null buffer when casting an array to vec<f64>

domodwyer commented 1 year ago

Yeah I don't remember doing anything with NULL masks when I wrote this.

HaoYang670 commented 1 year ago

Hi @domodwyer, could you please to fix this?

HaoYang670 commented 1 year ago

BTW, the estimate quantile algorithm doesn't follow the paper, any reason for this? https://github.com/apache/arrow-datafusion/blob/df8aa7a2e2a6f54acfbfed336b84144256fb7ff8/datafusion/physical-expr/src/aggregate/tdigest.rs#L523-L524

image

domodwyer commented 1 year ago

Hi @HaoYang670,

I do not have time to fix this in the short term as I am working on a project. I can take a look in a few weeks - feel free to fix this yourself if you need it sooner :+1: