boostorg / container_hash

Generic hash function for STL style unordered containers
https://boost.org/libs/container_hash
29 stars 39 forks source link

Regression in boost::hash with version 1.81.0 #35

Open kkumar45 opened 10 months ago

kkumar45 commented 10 months ago

We are noticing regression with boost::hash when used with std::u16string in boost 1.81.0 version and compared to 1.78.0. As per the 1.81.0 documentation, it seems there are major changes in boost::hash and the performance with string should be improved. Any thoughts on this regression?

Sample program -

#include <boost/functional/hash.hpp>
#include <iostream>
#include <chrono>

int main()
{
    // Start the timer
    auto start = std::chrono::high_resolution_clock::now();

    for(int i = 0; i < 10; i++)
    {
        boost::hash<std::u16string>()(u"hello");
    }

    // End the timer
    auto end = std::chrono::high_resolution_clock::now();

    // Calculate the duration
    std::chrono::duration<double> duration = end - start;

    // Print the duration in seconds
    std::cout << "Execution time: " << (duration.count())/10 << " seconds." << std::endl;

    return 0;
}

Link 1.78.0 - https://godbolt.org/z/ManWPoK8j Link 1.81.0 - https://godbolt.org/z/hejbsfhc6

pdimov commented 10 months ago

String performance is indeed improved, but only for narrow strings (e.g. std::string.) For char16_t, the generic sequence hash is still used.

The inner loop in 1.71 has three multiplications, one shift and two xors

.L9:
        movzx   eax, WORD PTR [rdx]
        add     rdx, 2
        imul    rax, r15
        mov     r9, rax
        shr     r9, 47
        xor     rax, r9
        imul    rax, r15
        xor     rax, r8
        imul    rax, r15
        lea     r8, [rax+r13]
        cmp     rsi, rdx
        jne     .L9

whereas the one in 1.81 has two multiplications, three shifts and three xors

.L9:
        movzx   r8d, WORD PTR [rdx]
        add     rdx, 2
        add     r8, r13
        add     r8, rax
        mov     rax, r8
        shr     rax, 32
        xor     rax, r8
        imul    rax, r15
        mov     r8, rax
        shr     r8, 32
        xor     rax, r8
        imul    rax, r15
        mov     r8, rax
        shr     r8, 28
        xor     rax, r8
        cmp     rsi, rdx
        jne     .L9

Which one is better is highly dependent on the specific CPU architecture. It looks like the latest x86 CPUs have a very fast multiplication, so the 1.78 implementation wins; but, if it's any consolation, the newer one is more theoretically sound and has better hash quality.

kkumar45 commented 10 months ago

@pdimov Are there any plan to address it, anytime soon?