facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.51k stars 1.15k forks source link

Different result of count(distinct x) from Presto on NaN inputs #9159

Open kagamiori opened 7 months ago

kagamiori commented 7 months ago

Bug description

When there are multiple NaN in the input of count(distinct x), Velox treat all NaN to be distinct, while Presto treat NaNs to be duplicates.

Velox:

TEST_F(CountAggregationTest, distinct) {
  auto nan = std::numeric_limits<double>::quiet_NaN();
  auto data = makeRowVector({
      makeFlatVector<double>({1.1, nan, nan, nan, nan, nan, nan, nan}),
  });
  createDuckDbTable({data});

  // Global aggregation.
  auto testGlobal = [&](const std::string& input) {
    auto plan =
        PlanBuilder()
            .values({data})
            .singleAggregation({}, {fmt::format("count(distinct {})", input)})
            .planNode();
    AssertQueryBuilder(plan, duckDbQueryRunner_)
        .assertResults(
            fmt::format("SELECT count(distinct {}) FROM tmp", input));
  };

  testGlobal("c0");  -- Velox result is 8
}

Presto:

SELECT
    COUNT(DISTINCT c0)
FROM (
    VALUES
        (1.1),
        (NAN()),
        (NAN())
) t(c0); -- Presto result is 2

System information

N/A

Relevant logs

No response

kagamiori commented 7 months ago

cc @mbasmanova