facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.52k stars 1.15k forks source link

Different results of approx_distinct between Velox and Presto-Java #9761

Closed kagamiori closed 4 months ago

kagamiori commented 6 months ago

Description

Velox and Presto returns different results with the following query.

Velox:

TEST_F(ApproxDistinctTest, aaa) {
  auto input = makeRowVector({
      makeFlatVector<int32_t>(
          {219716167, 422847510, 489801358, 1958888195, 606515184, 851641245}),
      makeConstant<double>(0.18802535624105074, 6),
  });
  auto expected = makeRowVector({
      makeConstant<int64_t>(5, 1),
  });

  auto plan = PlanBuilder()
                  .values({input})
                  .singleAggregation({}, {"approx_distinct(c0, c1)"})
                  .planNode();
  AssertQueryBuilder builder(plan);
  builder.assertResults(expected);  // -- Velox actual result is 3.
}

Presto:

SELECT
    APPROX_DISTINCT(c0, c1)
FROM (
    VALUES
        (219716167, 0.18802535624105074),
        (422847510, 0.18802535624105074),
        (489801358, 0.18802535624105074),
        (1958888195, 0.18802535624105074),
        (606515184, 0.18802535624105074),
        (851641245, 0.18802535624105074)
) t(c0, c1);  -- Presto returns 5.

For this query, both Velox and Presto use dense HLL, so their results should match. During investigation, I found that Velox and Presto got different hash values on the same input, which leads to differences at subsequent calculations based on the hash values.

Error Reproduction

The mismatch can be reproduced via the unit test and query given above.

Relevant logs

No response

mbasmanova commented 6 months ago

I found that Velox and Presto got different hash values on the same input

Would you share the input value and Velox and Presto hashes for that value?

kagamiori commented 6 months ago

I found that Velox and Presto got different hash values on the same input

Would you share the input value and Velox and Presto hashes for that value?

Yes, for example, for the input value 219716167, Velox gets a hash value 1954281567766450499 (uint64_t), while Presto gets a hash value -2225388838451473260 (long).

mbasmanova commented 6 months ago

for the input value 219716167, Velox gets a hash value 1954281567766450499 (uint64_t), while Presto gets a hash value -2225388838451473260 (long).

Does this discrepancy affect other functions, i.e. checksum?

kagamiori commented 6 months ago

for the input value 219716167, Velox gets a hash value 1954281567766450499 (uint64_t), while Presto gets a hash value -2225388838451473260 (long).

Does this discrepancy affect other functions, i.e. checksum?

No, checksum uses a different API from the xxHash library (i.e., XXH64_round() vs. XXH64()). XXH64_round() implementation matches AbstractLongType::hash() in Presto, so the checksum function is good.

kagamiori commented 6 months ago

Update: When I change the input type in the Velox unit test from int32_t to int64_t, the result matches with Presto. It turns out that the length of the input data type participates the hash calculation. Velox uses the actual input type length. https://github.com/facebookincubator/velox/blob/7d76b1d9ee814c8911d849697071acccf3e18b0c/velox/functions/prestosql/aggregates/ApproxDistinctAggregate.cpp#L112-L115

On the other hand, Presto always uses the length of long for all integer types, including tinyint, smallint, integer, bigint, timestamp, and date, plus the floating-point types. E.g.,

In addition, there is a special handling for timestamp_with_timezone type.

kagamiori commented 4 months ago

Discussed offline. Velox just uses a different hash function from Presto. This hash function sometimes performs better and sometimes performs worse than Presto. There is no clear evidence that one hash function is better than the other. So we won't fix Velox to align with Presto.