Closed kagamiori closed 4 months ago
I found that Velox and Presto got different hash values on the same input
Would you share the input value and Velox and Presto hashes for that value?
I found that Velox and Presto got different hash values on the same input
Would you share the input value and Velox and Presto hashes for that value?
Yes, for example, for the input value 219716167
, Velox gets a hash value 1954281567766450499
(uint64_t), while Presto gets a hash value -2225388838451473260
(long).
for the input value 219716167, Velox gets a hash value 1954281567766450499 (uint64_t), while Presto gets a hash value -2225388838451473260 (long).
Does this discrepancy affect other functions, i.e. checksum?
for the input value 219716167, Velox gets a hash value 1954281567766450499 (uint64_t), while Presto gets a hash value -2225388838451473260 (long).
Does this discrepancy affect other functions, i.e. checksum?
No, checksum uses a different API from the xxHash library (i.e., XXH64_round() vs. XXH64()). XXH64_round() implementation matches AbstractLongType::hash() in Presto, so the checksum function is good.
Update: When I change the input type in the Velox unit test from int32_t to int64_t, the result matches with Presto. It turns out that the length of the input data type participates the hash calculation. Velox uses the actual input type length. https://github.com/facebookincubator/velox/blob/7d76b1d9ee814c8911d849697071acccf3e18b0c/velox/functions/prestosql/aggregates/ApproxDistinctAggregate.cpp#L112-L115
On the other hand, Presto always uses the length of long
for all integer types, including tinyint, smallint, integer, bigint, timestamp, and date, plus the floating-point types. E.g.,
In addition, there is a special handling for timestamp_with_timezone type.
Discussed offline. Velox just uses a different hash function from Presto. This hash function sometimes performs better and sometimes performs worse than Presto. There is no clear evidence that one hash function is better than the other. So we won't fix Velox to align with Presto.
Description
Velox and Presto returns different results with the following query.
Velox:
Presto:
For this query, both Velox and Presto use dense HLL, so their results should match. During investigation, I found that Velox and Presto got different hash values on the same input, which leads to differences at subsequent calculations based on the hash values.
Error Reproduction
The mismatch can be reproduced via the unit test and query given above.
Relevant logs
No response