RedSpah / xxhash_cpp

Port of the xxhash library to C++17.
BSD 2-Clause "Simplified" License
163 stars 31 forks source link

Possible hashing inconsistency #4

Closed foolnotion closed 5 years ago

foolnotion commented 5 years ago

Hi,

Apologies in advance for the vague description. I am not really sure what's going on so I'll just do my best to describe my usage scenario:

Somewhere in my program I am building a std::vector<uint64_t>:

        std::vector<uint64_t> hashes;
        std::transform(symbols.begin() + i - size, symbols.begin() + i, std::back_inserter(hashes), [](const Symbol& sm) { return sm.CalculatedHashValue; }); 
        hashes.push_back(s.HashValue);
        auto hash = xxh::xxhash<64>(hashes);
        fmt::print("hash(");
        for (auto h : hashes) { fmt::print("{} ", h); }
        fmt::print(") = {}\n", hash);
        s.CalculatedHashValue = hash;        

This outputs:

hash(123 456 0) = 14523173615704738576

I wanted to debug this result so in main.cpp I did:

    std::array<uint64_t, 3> arr { 123, 456, 0 };
    std::cout << "array hash = " << xxh::xxhash<64>(arr, 0) << std::endl;

This outputs:

array hash = 13763445824703203362

As you can see there's an inconsistency even though the values are the same.

The next thing I added another test in my main.cpp:

    std::vector<uint64_t> vec { 123, 456, 0 };
    std::cout << "vector hash = " << xxh::xxhash<64>(vec) << std::endl;

And now the weird part. After adding the above two lines, I get:

hash(123 456 0  ) = 13763445824703203362
array hash = 13763445824703203362
vector hash = 13763445824703203362

Now the hashes are consistent as they should be. Could this be a bug?

EDIT: The problem seems to go away if I manually define the endianess before including xxhash.hpp:

#define XXH_CPU_LITTLE_ENDIAN 1 

Best, Bogdan

RedSpah commented 5 years ago

Can't reproduce.

Providing the entire relevant part of the file (as a gist for example) could help a lot in troubleshooting, as well as what compiler / platform you're using (since I'm getting different hashes than both of your results and that's enough of a "uh oh" thing in itself).

RedSpah commented 5 years ago

For now, please report what hash results do you get when defining #define XXH_FORCE_MEMORY_ACCESS 0 before including xxhash.hpp.

foolnotion commented 5 years ago

#define XXH_FORCE_MEMORY_ACCESS 0 does not change anything:

hash(123 456 0) = 14523173615704738576
array hash = 13763445824703203362

I am using arch linux in the windows WSL. Tested with gcc-8.2 and clang-7.0.0, same results.

I've made a gist with my files: https://gist.github.com/foolnotion/46b2d9d54f2939487c4805b7830f13bc#file-sentence-cpp-L114

Basically, sentence.cpp line 114 is where the hashing happens. Then in main.cpp I just test again for the same values.

Thanks.

RedSpah commented 5 years ago

Alright, the bug you encountered is now solved. Copying a "solution" from StackOverflow instead of actually investigating the rules of lifetimes of static objects in C++ was bound to bite me in the ass one day anyway.

The bigger bug, however is that your hash values and mine don't match. Which is a very bad, very big issue for a hash algorithm to have.

I will investigate into it tomorrow, set up Travis CI and see whether the discrepancy is compiler dependent or caused by something else.

I'm sorry for any difficulties this mess has caused.

foolnotion commented 5 years ago

Thanks for taking the time to investigate this. Indeed, not getting the same hash value for identical input is a bit worrisome.

I've tested on 2 machines that have identical WSL environments but different hardware.

In all cases I get the same hash value xxhash({123, 456, 0}) = 13763445824703203362. I will be happy to help you test different things.

RedSpah commented 5 years ago

Alright, I believe I found the result of the inconsistency. And, unsurprisingly, it turned out to be just a typo.

#if defined(_MSC_VER)
        static inline uint32_t rotl32(uint32_t x, int32_t r) { return _rotl(x, r); }
        static inline uint64_t rotl64(uint64_t x, int32_t r) { return _rotl64(x, r); }
#else
        static inline uint32_t rotl32(uint32_t x, int32_t r) { return ((x << r) | (x >> (32 - r))); }
        static inline uint32_t rotl64(uint64_t x, int32_t r) { return ((x << r) | (x >> (64 - r))); }
#endif

64-bit rotl on non-msvc compilers accidentally returned only a 32 bit value. After fixing this, the results now seem to be identical.

The upside is that I now have a CI set up, so it'll be easier to catch these sorts of things in the future.

foolnotion commented 5 years ago

That's great. Just for the record the new hash value is 7271121650880935067. If that's identical with what you have I think you can close this bug.

RedSpah commented 5 years ago

Bug fixed, closing.