Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
8.93k stars 770 forks source link

Suggestions list for future evolutions #458

Open Cyan4973 opened 3 years ago

Cyan4973 commented 3 years ago

List updated June 30th 2023 :

Objectives for v0.8.2 - Completed

Objectives for v0.8.3

Objectives for v0.9.0

easyaspi314 commented 3 years ago

Another one I was thinking was an option to disable streaming.

The streaming API takes up a good chunk of the binary size:

Clang 10.0.1, aarch64, Termux

$ clang -Oz -shared -fPIC xxhash.c -s -o libxxh_stream.so
$ clang -Oz -shared -fPIC xxhash.c -s -o libxxh_nostream.so -DXXH_NO_STREAM
$ size -A *.so | grep -E '(:|Total)'
libxxh_nostream.so  :
Total                 7219
libxxh_stream.so  :
Total                 12848

Edit: -O3:

libxxh_nostream.so  :
Total                 21966
libxxh_stream.so  :
Total                 35067
easyaspi314 commented 3 years ago

Another cheat optimization is to use __attribute__((__pure__)) on as many xxHash functions as possible.

It says, roughly:

  1. This function has no side effects
  2. This function will return the same value given that the parameters, memory pointed to by the parameters, and global variables are not modified.
  3. This function can safely have repeated calls eliminated by the compiler as long as nothing else changes.

Basically, __pure__ is the magic behind the strlen optimization. (Excluding the second optimization where the compiler treats it as a built-in function and inlines/const props it)

const char *bad_strchr(const char *s, int c)
{
    for (size_t i = 0; i < strlen(s) /* EEK */; i++) {
        if ((unsigned char)s[i] == (unsigned char)c) {
            return &s[i];
        }
    }
    return NULL;
}

Any decent compiler will change it to this:

const char *bad_strchr(const char *s, int c)
{
    const size_t len = strlen(s); // strlen is pure, we only need to call it once
    for (size_t i = 0; i < len; i++) {
        if ((unsigned char)s[i] == (unsigned char)c) {
            return &s[i];
        }
    }
    return NULL;
}

Although it is easiest to see with this code:

size_t strlenx2(const char *s)
{
     return strlen(s) + strlen(s);
}

Equivalent code with how Clang shuffles registers on x86_64:

// optimized
size_t strlenx2(const char *s)
{
    size_t len = strlen(s);
    len += len;
    return len;
}
// unoptimized
size_t strlenx2(const char *s)
{
    size_t tmp_len; // r14
    const char *tmp_s; // rbx
    tmp_s = s;
    size_t len = strlen(s);
    tmp_len = len;
    s = tmp_s;
    len = strlen(s);
    len += tmp_len;
    return len;
}

This could possibly improve performance on some hash tables depending on how they are used. Primarily thinking of code that looks like this:

    table[key].foo = "Foo";
    table[key].bar = "Bar";

Note that the compiler sometimes can figure this out on its own if xxHash is inlined, but this applies to both inline and extern functions.

I would have an option added to explicitly disable it for a fair benchmark.

Other ideas:

Cyan4973 commented 3 years ago

Yes, I like pure functions, so I'm all for it. Also note the existence of const functions in gcc, with an even stricter set of restrictions.

In general, I would expect -O3 to spend enough cpu to discover which functions are pure, so it's unclear if xxhash will receive a measurable boost to performance with these function qualifiers, that being said, I like them even if the only impact is to provide better code documentation. Also, in the context of library linking, this is an information that the linker can't guess from just the function signature, so it could end up being actually useful on the user side.

easyaspi314 commented 3 years ago

In general, I would expect -O3 to spend enough cpu to discover which functions are pure, so it's unclear if xxhash will receive a measurable boost to performance with these function qualifiers.

It appears that with XXH_INLINE_ALL, Clang and GCC can't tell that XXH3 is pure without the annotation, but it can figure out XXH32 and XXH64.

easyaspi314 commented 3 years ago

A few months ago, you mentioned how XXH3 is threadable. Obviously this would be an opt-in feature, as some programs like compilers (which I know at least Clang and GNU LD use XXH64) are designed to remain on a single thread to parallelize with make.

With some experimentation, it seems to be beneficial to spawn a second thread once you get to ~8-16MB.

On my phone (haven't tested on my PC yet because I have yet to master Windows threads), 6-8 MB seems to be the range where it is beneficial, with a max speed of 7.3 GB/s compared to 5.2 GB/s on one thread.

The implementation would be pretty simple; the most complicated thing here is dealing with the pthread struct and the accumulate loop which should probably be outlined to its own function to avoid copypasta.

I believe we can do a similar thing with CreateThread on Windows.

Note that this would technically conflict with the __attribute__((__pure__)) idea, although if we talk about the end effects, it will still be pure.

Code I was testing ```c #ifdef XXH_PTHREADS #include /* * A data block for pthreads. * XXX: Some of these fields can probably be removed. */ typedef struct { xxh_u64 acc[XXH_ACC_NB]; const xxh_u8 *input; const xxh_u8 *secret; size_t secretSize; size_t nbStripes; size_t block_len; size_t nb_blocks; XXH3_f_accumulate_512 f_acc512; XXH3_f_scrambleAcc f_scramble; } XXH3_pthreadData_t; /* * The length at which xxHash should spawn a pthread. * Spawning threads has significant overhead and is a waste * of time for short hashes. * * The most optimal value varies significantly on the platform. */ #ifndef XXH_PTHREADS_THRESHOLD # define XXH_PTHREADS_THRESHOLD (8 * 1048576U) /* 8 MiB */ #endif XXH_NO_INLINE void* XXH3_accumulate_pthread(void* d) { XXH3_pthreadData_t* data = (XXH3_pthreadData_t *)d; size_t n; /* TODO: DRY, this loop is copied multiple times. */ for (n = 0; n < data->nb_blocks; n++) { XXH3_accumulate(data->acc, data->input + n*data->block_len, data->secret, data->nbStripes, data->f_acc512); (data->f_scramble)(data->acc, data->secret + data->secretSize - XXH_STRIPE_LEN); } pthread_exit((void*)data); } #endif /* Note: move XXH_alignedMalloc/XXH_alignedFree above this or give protos */ XXH_FORCE_INLINE void XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc, const xxh_u8* XXH_RESTRICT input, size_t len, const xxh_u8* XXH_RESTRICT secret, size_t secretSize, XXH3_f_accumulate_512 f_acc512, XXH3_f_scrambleAcc f_scramble) { size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE; size_t const block_len = XXH_STRIPE_LEN * nbStripesPerBlock; size_t const nb_blocks = (len - 1) / block_len; size_t n = 0; XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN); #ifdef XXH_PTHREADS /* * XXH3 operates in blocks which are added together. * * Normally, this is constantly added to the acc array on the fly, like so; * acc = acc + sum[0->N] { accumulate(N) }; * * Due to the properties of addition, we can actually calculate blocks in * parallel if we start with a second acc starting zeroed: * acc = (acc + sum[0->N/2] { accumulate(N) }) * + ( 0 + sum[N/2->N] { accumulate(N) }) * * Spawning a single pthread to process half of the data is * beneficial after about 8 MiB (*very* platform dependent). * There is not much use in spawning any more threads; it already takes * hundreds of thousands of iterations for there to be a benefit, * and it would get very messy and add even more overhead. */ if (len >= XXH_PTHREADS_THRESHOLD) { /* * Using malloc is faster for some reason. Likely aliasing rules. */ XXH3_pthreadData_t* threadData = (XXH3_pthreadData_t*)XXH_alignedMalloc(sizeof(XXH3_pthreadData_t), 64); /* * If malloc succeeds, try to start a thread, otherwise fall back to * the single threaded loop after the #endif. */ if (threadData != NULL) { pthread_t thread; int threadLaunched; void* status = (void *)threadData; /* Fill the struct and set it to process the second half. */ memset(threadData->acc, 0, sizeof(threadData->acc)); threadData->input = input + ((nb_blocks - (nb_blocks / 2)) * block_len); threadData->secret = secret; threadData->secretSize = secretSize; threadData->nbStripes = nbStripesPerBlock; threadData->nb_blocks = nb_blocks - (nb_blocks / 2); threadData->f_acc512 = f_acc512; threadData->f_scramble = f_scramble; /* * Launch the thread on the second half of the input. * * We don't care about whether it actually started until later. */ threadLaunched = !pthread_create(&thread, NULL, &XXH3_accumulate_pthread, status); /* Process the first half on the main thread */ for (; n < nb_blocks / 2; n++) { XXH3_accumulate(acc, input + n*block_len, secret, nbStripesPerBlock, f_acc512); f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN); } /* * If we have launched our thread, finish it, merge its acc with " the main thread's acc and mark the section as completed. * If it failed, we process the second half after the #endif * on the main thread. */ if (threadLaunched && !pthread_join(thread, &status)) { size_t i; /* Merge the acc fragments */ for (i = 0; i < XXH_ACC_NB; i++) { /* associative property */ acc[i] += threadData->acc[i]; } /* Mark that the thread successfully processed the second half */ n += threadData->nb_blocks; } XXH_alignedFree(threadData); } } #endif /* XXH_PTHREADS */ for (; n < nb_blocks; n++) { XXH3_accumulate(acc, input + n*block_len, secret, nbStripesPerBlock, f_acc512); f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN); } /* ... */ } ```

As I mention in the comment, I don't see any reason to spawn more than one helper thread, as we waste hundreds of thousands of possible accumulate loops by setting up each pthread, meaning 4 threads would likely require a ridiculous 64-128 MB and a much more complicated error handling routine.

easyaspi314 commented 3 years ago

So I was wondering if we should start doing Doxygen? We don't necessarily have to set up a server for it.

Especially since xxhash.h is massive now, having a little Doxygen site might help and would probably be easier than writing xxh3_spec.md

It also gives us some opportunity to document the internals because we can group them.

Here are some examples:

XXH64 doxygen 2 xxh_force_memory_access Doxygen 4

Also, didn't we plan on switching XXH64_hash_t to unsigned long on 64-bit Linux?

Cyan4973 commented 3 years ago

I was wondering if we should start doing Doxygen?

Yes, that's a good idea. Moving code comments to Doxygen parsing convention can be done progressively.

didn't we plan on switching XXH64_hash_t to unsigned long on 64-bit Linux?

I don't see a benefit in such a change

easyaspi314 commented 3 years ago

I don't see a benefit in such a change

uint64_t is unsigned long on LP64, so it would be consistent.

easyaspi314 commented 3 years ago
$ cat test.c
#include <xxhash.h>
#include <stdio.h>
#include <inttypes.h>

int main(void)
{
    printf("%#016" PRIx64 "\n", XXH64("hello", 5, 0));
    return 0;
}
$ gcc -std=gnu99 -O2 -Wall -c test.c -I xxHash
// Ok
$ g++ -std=gnu++11 -O2 -Wall -c -xc++ test.c -I xxHash
// Ok
$ gcc -std=gnu90 -O2 -Wall -c test.c -I xxHash -Wno-long-long
test.c: In function 'main':
test.c:7:12: warning: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'XXH64_hash_t' {aka 'long long unsigned int'} [-Wformat=]
    7 |     printf("%#016" PRIx64 "\n", XXH64("Hello", 5, 0));
      |            ^~~~~~~              ~~~~~~~~~~~~~~~~~~~~
      |                                 |
      |                                 XXH64_hash_t {aka long long unsigned int}
In file included from test.c:3:
/usr/include/inttypes.h:127:34: note: format string is defined here
  127 | #define PRIx64   __PRI_64_prefix"x"  /* uint64_t */
$

In gnu90 mode, -Wformat fires off because on LP64, PRIx64 is "lx".

The reverse is true if you do "%llx", it will fire a warning on C++ and C99.

easyaspi314 commented 3 years ago

Some Doxygen documentation added in #462

Cyan4973 commented 3 years ago

Declaring relevant functions as pure and const would be a good follow up.

easyaspi314 commented 2 years ago
  1. XXH_SIZEOPT config option
    • ==0: normal
    • ==1: Disables forceinline and manual unrolling
    • ==2: Reuse streaming API for single shot, other dirty size hacks?
  2. Potential speed boost: We can hash multiple acc blocks at a time until a scramble. So something like this:
    accumulate_doubleAcc()
    {
    xxh_u64 acc2[8] = {0};
    size_t n = 0;
    // alternative: duffs device but that might harm
    // interleaving
    if (nbStripes % 2 == 1) {
        accumulate_512(acc2);
        n++;
    }
    while (n < nbStripes) {
        accumulate_512(acc);
        n++;
        accumulate_512(acc2);
        n++;
    }
    for (n = 0; n < 8; n++) {
        acc[n] += acc2[n];
    }
    }

I didn't see any benefit on NEON AArch64 no matter how many NEON lanes I chose, and ARMv7 and SSE2 don't have enough registers.

However, I think that AVX2 and AVX512 would likely benefit since their loops are tighter. I will benchmark on Ryzen when I get a chance.

Edit: Clang has no difference on Ryzen 5 3600 and GCC clearly gets confused.

Cyan4973 commented 1 year ago

Updated list of objectives for v0.8.2

Cyan4973 commented 1 year ago

I was considering (as a tentative objective) shipping DISPATCH=1 by default for the generation of xxhsum CLI, but there is still a significant amount of work to do to reach this stage safely, and I'm concerned it would delay v0.8.2 release by too long. So, instead, I've pushed this objective to an hypothetical v0.8.3 future release. (will be folded into v0.9.0 if need be).

t-mat commented 1 year ago

As for XXH_OLD_NAMES, should we add deprecation message for it in v0.8.2?

Cyan4973 commented 1 year ago

As for XXH_OLD_NAMES, should we add deprecation message for it in v0.8.2?

This seems like a good idea. If I understand properly, XXH_OLD_NAMES is disabled by default, so the warning could be triggered just on detecting if it's set.

t-mat commented 1 year ago

We can add "Milestone" via right side pane of issue view. And we also can assign/reuse it to other issues to indicate the milestone. It'll be useful at the future review of issues.
For example, we can assign v0.9.0 milestone to #483.