facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.48k stars 1.14k forks source link

Window fuzzer test is flaky in nightly runs #11213

Open kagamiori opened 2 weeks ago

kagamiori commented 2 weeks ago

Description

Example failure: https://github.com/facebookincubator/velox/actions/runs/11247502907/job/31271630633

Error Reproduction

No response

Relevant logs

No response

kevinwilfong commented 2 weeks ago

I noticed this on some of my PRs as well, I was wondering if it my be an issue exposed by https://github.com/facebookincubator/velox/pull/10006 with NaNs in the off1 column

kagamiori commented 2 weeks ago

Update: I reproduced the error with this command: velox_window_fuzzer_test --enable_window_reference_verification --presto_url=http://127.0.0.1:8080 --logtostderr=1 --minloglevel=0 --duration_sec=360 --num_batches=1 --batch_size=30 --only=stddev_samp --seed=2356821713 --duration_sec=0 --steps=1 -v=1.

This iteration runs Window[1][partition by [p0] order by [s0 DESC NULLS FIRST] w0 := stddev_samp(ROW["c0"]) RANGE between off0 FOLLOWING and off1 FOLLOWING]. There is one partition containing three input rows with the exact same values of sorting key, frame start, and frame end, i.e., they are peer rows and should all return the same window result. I found that WindowPartition::computeKRangeFrameBounds() doesn't handle NaN correctly and produces incorrect frame bounds. This further makes AggregateWindowFunction::analyzeFrameValues() to incorrectly consider the frames in this partition to be incremental and calls incrementalAggregation().

I'm continuing investigating.

Example queries to reproduce the mismatch:

SELECT
    s0,
    SUM(c0) OVER (
        ORDER BY
            s0 asc
        RANGE BETWEEN 0.0 PRECEDING AND nan() PRECEDING 
    )
FROM (
    VALUES
        (1, 1.0),
        (2, 1.0),
        (3, 1.0)
) t(c0, s0); -- different results

SELECT
    s0,
    SUM(c0) OVER (
        ORDER BY
            s0 asc
        RANGE BETWEEN nan() PRECEDING AND 0.0 PRECEDING 
    )
FROM (
    VALUES
        (1, 1.0),
        (2, 1.0),
        (3, 1.0)
) t(c0, s0); -- Presto throws while Prestissimo doesn't

SELECT
    s0,
    SUM(c0) OVER (
        ORDER BY
            s0 desc
        RANGE BETWEEN 0.0 FOLLOWING  AND nan() FOLLOWING  
    )
FROM (
    VALUES
        (1, 1.0),
        (2, 1.0),
        (3, 1.0)
) t(c0, s0); -- Presto throws while Prestissimo doesn't

SELECT
    s0,
    SUM(c0) OVER (
        ORDER BY
            s0 desc
        RANGE BETWEEN nan() FOLLOWING  AND 0.0 FOLLOWING  
    )
FROM (
    VALUES
        (1, 1.0),
        (2, 1.0),
        (3, 1.0)
) t(c0, s0); -- different results
kagamiori commented 1 week ago

Both Presto and Velox results of the queries above are incorrect. https://github.com/prestodb/presto/issues/23822

kagamiori commented 1 week ago

Hi @aditi-pandit and @pramodsatya, the recent window fuzzer enhancement of k-range frames found this bug in both Presto and Velox. We're considering making window operations return NULL when any of the frame bound is NaN (https://github.com/prestodb/presto/issues/23822#issuecomment-2411843864). Please let us know if you have any thoughts. Thanks!

pramodsatya commented 1 week ago

Hi @aditi-pandit and @pramodsatya, the recent window fuzzer enhancement of k-range frames found this bug in both Presto and Velox. We're considering making window operations return NULL when any of the frame bound is NaN (prestodb/presto#23822 (comment)). Please let us know if you have any thoughts. Thanks!

Thanks @kagamiori, returning null for Nan frame bounds would be better.