Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
8.69k stars 760 forks source link

General discussion about XXH3 #175

Closed easyaspi314 closed 3 years ago

easyaspi314 commented 5 years ago

This is going to be a tracker for discussion, questions, feedback, and analyses about the new XXH3 hashes, found in the xxh3 branch.

@Cyan4973's comments (from xxhash.h):

XXH3 is a new hash algorithm, featuring vastly improved speed performance for both small and large inputs.

A full speed analysis will be published, it requires a lot more space than this comment can handle.

In general, expect XXH3 to run about ~2x faster on large inputs, and >3x faster on small ones, though exact difference depend on platform.

The algorithm is portable, will generate the same hash on all platforms. It benefits greatly from vectorization units, but does not require it.

XXH3 offers 2 variants, _64bits and _128bits. The first 64-bits field of the _128bits variant is the same as _64bits result. However, if only 64-bits are needed, prefer calling the _64bits variant. It reduces the amount of mixing, resulting in faster speed on small inputs.

The XXH3 algorithm is still considered experimental. It's possible to use it for ephemeral data, but avoid storing long-term values for later re-use. While labelled experimental, the produced result can still change between versions.

The API currently supports one-shot hashing only. The full version will include streaming capability, and canonical representation Long term optional feature may include custom secret keys, and secret key generation.

There are still a number of opened questions that community can influence during the experimental period. I'm trying to list a few of them below, though don't consider this list as complete.

Cyan4973 commented 5 years ago

I think this is a good documentation. I'm also opened to using latex in the comments. While reading formula is not straightforward without a parser, it remains human-readable. I don't know of any better counter-proposal. Even markdeep defers to LaTex for mathematical formulas.

Cyan4973 commented 5 years ago

Since I could not find any more improvement, I guess I'm done with the 64-bit variant. I'm going to merge the xxh3 branch soon, after updating the self-tests, and verifying portability.

The 128-bits version still needs updating.

easyaspi314 commented 5 years ago

Yeah, I want to be sure the NEON code matches because I am a little worried that it isn't.

The 64-bit variant definitely looks good to me.

After that, should we work on that dispatcher? It is mostly complete, but I dunno how well it would work in the field. I presume most people who are distributing binaries would want the dispatcher.

I am also considering dispatching ARMv7a. There are some chips without NEON. However, detecting it is really difficult mainly because even though there is an instruction for it, you can't call it because it has a high privilege level. The Android NDK does have a function for that, and all iOS 5+ devices have it. It's low priority at the moment, though, as these devices are genuinely rare.

However, coming from someone who literally doesn't have a chip with AVX2, supporting both SSE2 and AVX2 in the same binary is far more important. I approximate that 80-90% of the people using x86 chips have no idea whether their chip supports AVX2 or not, so distributing separate builds would be confusing af to them.

Cyan4973 commented 5 years ago

Having a dispatcher seems like a good idea, but we'll have to be cautious on its potential runtime impact.

Say, someone just calls XXH3_64bits() on a small input, expecting low-latency performance, what does the dispatcher do in this case ? Does it add a test to decide which variant is more appropriate ?

I guess the test is probably fine if the amount of data to process is large enough. Now, how much is "large enough" ?

Quick note : in the near future, I intend to introduce a streaming variant. The streaming variant will require the creation of a state. Such state might help, to store some test result, avoiding to probe the cpu every time.

easyaspi314 commented 5 years ago

I have a static variable, as well as a getter and a setter.

On GCC, the test is done before main(), and other compilers check whether the test has been initialized in XXH3_hashLong, and do the check there.

#ifdef XXH_MULTI_TARGET /* verified beforehand */

/* Prototypes for our code */
#ifdef __cplusplus
extern "C" {
#endif
void _XXH3_hashLong_AVX2(U64* acc, const void* data, size_t len, const U32* key);
void _XXH3_hashLong_SSE2(U64* acc, const void* data, size_t len, const U32* key);
void _XXH3_hashLong_Scalar(U64* acc, const void* data, size_t len, const U32* key);
#ifdef __cplusplus
}
#endif

/* What hashLong version we decided on. cpuid is a SLOW instruction -- calling it takes anywhere
 * from 30-40 to THOUSANDS of cycles, so we really don't want to call it more than once. */
static XXH_cpu_mode_t cpu_mode = XXH_CPU_MODE_AUTO;

/* xxh3-target.c will include this file. If we don't do this, the constructor will be called
 * multiple times. We don't want that. */
#if !defined(XXH3_TARGET_C) && defined(__GNUC__)
__attribute__((__constructor__))
#endif
static void
XXH3_featureTest(void)
{
    int max, data[4];
    /* First, get how many CPUID function parameters there are by calling CPUID with eax = 0. */
    XXH_CPUID(data, /* eax */ 0);
    max = data[0];
    /* AVX2 is on the Extended Features page (eax = 7, ecx = 0), on bit 5 of ebx. */
    if (max >= 7) {
        XXH_CPUIDEX(data, /* eax */ 7, /* ecx */ 0);
        if (data[1] & (1 << 5)) {
            cpu_mode = XXH_CPU_MODE_AVX2;
            return;
        }
    }
    /* SSE2 is on the Processor Info and Feature Bits page (eax = 1), on bit 26 of edx. */
    if (max >= 1) {
        XXH_CPUID(data, /* eax */ 1);
        if (data[3] & (1 << 26)) {
            cpu_mode = XXH_CPU_MODE_SSE2;
            return;
        }
    }
    /* Must be scalar. */
    cpu_mode = XXH_CPU_MODE_SCALAR;
}

static void
XXH3_hashLong(U64* restrict acc, const void* restrict data, size_t len, const U32* restrict key)
{
    /* We haven't checked CPUID yet, so we check it now. On GCC, we try to get this to run
     * at program startup to hide our very dirty secret from the benchmarks. */
    if (cpu_mode == XXH_CPU_MODE_AUTO) {
        XXH3_featureTest();
    }
    switch (cpu_mode) {
    case XXH_CPU_MODE_AVX2:
        _XXH3_hashLong_AVX2(acc, data, len, key);
        return;
    case XXH_CPU_MODE_SSE2:
         _XXH3_hashLong_SSE2(acc, data, len, key);
         return;
    default:
         _XXH3_hashLong_Scalar(acc, data, len, key);
         return;
    }
}
#else /* !XXH_MULTI_TARGET */
   /* Include the C file directly and let the compiler decide which implementation to use. */
#  include "xxh3-target.c"
#endif /* XXH_MULTI_TARGET */

/* Should we keep this? */
XXH_PUBLIC_API void XXH3_forceCpuMode(XXH_cpu_mode_t mode)
{
#ifdef XXH_MULTI_TARGET
    cpu_mode = mode;
#endif
}

/* Should we keep this? */
XXH_PUBLIC_API XXH_cpu_mode_t XXH3_getCpuMode(void)
{
#ifdef XXH_MULTI_TARGET
    return cpu_mode;
#else
    return (XXH_cpu_mode_t) XXH_VECTOR;
#endif
}

Unless you manually reset it, cpuid is never called more than three times.

As for how it works, xxh3-target.c has the implementation of XXH3_hashLong, but the actual symbol name is defined by a macro, which will just be XXH3_hashLong on non-multitargeting code.

#ifdef XXH_MULTI_TARGET
/* The use of reserved identifiers is intentional; these are not to be used directly. */
#  if XXH_VECTOR == XXH_AVX2
#    define hashLong _XXH3_hashLong_AVX2
#  elif XXH_VECTOR == XXH_SSE2
#    define hashLong _XXH3_hashLong_SSE2
#  else
#    define hashLong _XXH3_hashLong_Scalar
#  endif
#else
#  define hashLong XXH3_hashLong
#endif

So yeah, I tried to make it as cheap as possible.

XXH3_hashLong dispatcher

I do want to mention that t1ha0 uses a function pointer instead of a jump table. IDK which is better.


#if T1HA_USE_INDIRECT_FUNCTIONS
/* Use IFUNC (GNU ELF indirect functions) to choice implementation at runtime.
 * For more info please see
 * https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
 * and https://sourceware.org/glibc/wiki/GNU_IFUNC */
#if __has_attribute(ifunc)
uint64_t t1ha0(const void *data, size_t len, uint64_t seed)
    __attribute__((ifunc("t1ha0_resolve")));
#else
__asm("\t.globl\tt1ha0\n\t.type\tt1ha0, "
      "%gnu_indirect_function\n\t.set\tt1ha0,t1ha0_resolve");
#endif /* __has_attribute(ifunc) */

#elif __GNUC_PREREQ(4, 0) || __has_attribute(constructor)

uint64_t (*t1ha0_funcptr)(const void *, size_t, uint64_t);

static __cold void __attribute__((constructor)) t1ha0_init(void) {
  t1ha0_funcptr = t1ha0_resolve();
}

#else /* T1HA_USE_INDIRECT_FUNCTIONS */

static __cold uint64_t t1ha0_proxy(const void *data, size_t len,
                                   uint64_t seed) {
  t1ha0_funcptr = t1ha0_resolve();
  return t1ha0_funcptr(data, len, seed);
}

uint64_t (*t1ha0_funcptr)(const void *, size_t, uint64_t) = t1ha0_proxy;

#endif /* !T1HA_USE_INDIRECT_FUNCTIONS */
#endif /* T1HA0_RUNTIME_SELECT */
easyaspi314 commented 5 years ago

For proof, I added a logger in XXH_CPUID which showed every time it was called.

CPUID called!
CPUID called!
CPUID called!
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH3 mode: SSE2
XXH32               :     102400 ->    85294 it/s ( 8329.5 MB/s)
XXH32 unaligned     :     102400 ->    49981 it/s ( 4880.9 MB/s)
XXH64               :     102400 ->    74702 it/s ( 7295.1 MB/s)
XXH64 unaligned     :     102400 ->    72518 it/s ( 7081.8 MB/s)
XXH3_64bits         :     102400 ->   173765 it/s (16969.3 MB/s)
XXH3_64b unaligned  :     102400 ->   171794 it/s (16776.8 MB/s)

also, my processor randomly went into super-sayian mode for XXH32. Probably a misread :thinking:

easyaspi314 commented 5 years ago

wait, this is happening every time… image

HOLD THE PHONE I did this when I was testing ARMv6t2


        if (((size_t)input & 3) == 0) {
            const U32* p32 = (const U32*) __builtin_assume_aligned(p, 4);
            do {
                v1 = XXH32_round(v1, *p32++);
                v2 = XXH32_round(v1, *p32++);
                v3 = XXH32_round(v1, *p32++);
                v4 = XXH32_round(v1, *p32++);
            } while ((const BYTE*)p32 < limit);
            p = (const BYTE*)p32;
        } else {
            do {
                v1 = XXH32_round(v1, XXH_get32bits(p)); p+=4;
                v2 = XXH32_round(v2, XXH_get32bits(p)); p+=4;
                v3 = XXH32_round(v3, XXH_get32bits(p)); p+=4;
                v4 = XXH32_round(v4, XXH_get32bits(p)); p+=4;
            } while (p < limit);
        }

wait, I stupid. v1 v1 v1 v1. noice one me.

                v1 = XXH32_round(v1, *p32++);
                v2 = XXH32_round(v1, *p32++);
                v3 = XXH32_round(v1, *p32++);
                v4 = XXH32_round(v1, *p32++);

nvm, my b.

Cyan4973 commented 5 years ago

OK, I think the multi-target looks good. Of great importance :

The second point is particularly important. You would be surprised : package managers actually dislike things which are runtime-decided. They are not end-users. Some of them might actually prefer a clear simple binary behavior, with less maintenance risks, even if it means forgiving performance for some targets.

One thing which is not too clear for me is if this mechanism enforces the existence of xxh3.h, as a way to generate multiple targets. xxh3.h is a temporary file, which will disappear when the algorithm stabilizes : its code will be integrated into xxhash.c.

easyaspi314 commented 5 years ago

xxh3.h is not required, however, I did add xxh3-target.c, which is the file that is included or compiled three times.

This is the only clean way to do it. We could technically use macro hell, but that would be super ugly and confusing.

Basically, the build process is this when running make MULTI_TARGET=1 xxhsum

cc -O3  -DXXH_MULTI_TARGET   -c -o xxhsum.o xxhsum.c
cc -O3  -DXXH_MULTI_TARGET   -c -o xxhash.o xxhash.c
cc -c -O3  -DXXH_MULTI_TARGET   xxh3-target.c -mavx2 -o xxh3-avx2.o
cc -c -O3  -DXXH_MULTI_TARGET   xxh3-target.c -msse2 -mno-sse3 -o xxh3-sse2.o
cc -c -O3  -DXXH_MULTI_TARGET   xxh3-target.c -mno-sse2 -o xxh3-scalar.o
cc   xxhsum.o xxhash.o xxh3-avx2.o xxh3-sse2.o xxh3-scalar.o   -o xxhsum

Without it, it is like this: (xxh3-target.c is included like a .h file)

cc -O3    -c -o xxhsum.o xxhsum.c
cc -O3    -c -o xxhash.o xxhash.c
cc   xxhsum.o xxhash.o   -o xxhsum

Also, I think I am liking the function pointer better. It skips the jump table.


#ifdef XXH_MULTI_TARGET
/* Prototypes for our code */
#ifdef __cplusplus
extern "C" {
#endif
void _XXH3_hashLong_AVX2(U64* acc, const void* data, size_t len, const U32* key);
void _XXH3_hashLong_SSE2(U64* acc, const void* data, size_t len, const U32* key);
void _XXH3_hashLong_Scalar(U64* acc, const void* data, size_t len, const U32* key);
#ifdef __cplusplus
}
#endif

/* What hashLong version we decided on. cpuid is a SLOW instruction -- calling it takes anywhere
 * from 30-40 to THOUSANDS of cycles), so we really don't want to call it more than once. */
static XXH_cpu_mode_t cpu_mode = XXH_CPU_MODE_AUTO;

/* The best XXH3 version that is supported. This is used for verification on XXH_setCpuMode to prevent
 * a SIGILL. It can be turned off with -DXXH_NO_VERIFY_MULTI_TARGET, in which the selected hash will
 * be used unconditionally. */
static XXH_cpu_mode_t supported_cpu_mode = XXH_CPU_MODE_AUTO;

/* We also store this as a function pointer, so we can just jump to it at runtime.
 * This matches the technique used by t1ha.
 * XXX: Are ifuncs better for ELF? */
static void (*XXH3_hashLong)(U64* acc, const void* data, size_t len, const U32* key);

/* Tests features for x86 targets and sets the cpu_mode and the XXH3_hashLong function pointer
 * to the correct value.
 *
 * On GCC compatible compilers, this will be run at program startup.
 *
 * xxh3-target.c will include this file. If we don't do this, the constructor will be called
 * multiple times. We don't want that. */
#if !defined(XXH3_TARGET_C) && defined(__GNUC__)
__attribute__((__constructor__))
#endif
static void XXH3_featureTest(void)
{
    int max, data[4];
    /* First, get how many CPUID function parameters there are by calling CPUID with eax = 0. */
    XXH_CPUID(data, /* eax */ 0);
    max = data[0];
    /* AVX2 is on the Extended Features page (eax = 7, ecx = 0), on bit 5 of ebx. */
    if (max >= 7) {
        XXH_CPUIDEX(data, /* eax */ 7, /* ecx */ 0);
        if (data[1] & (1 << 5)) {
            cpu_mode = supported_cpu_mode = XXH_CPU_MODE_AVX2;
            XXH3_hashLong = &_XXH3_hashLong_AVX2;
            return;
        }
    }
    /* SSE2 is on the Processor Info and Feature Bits page (eax = 1), on bit 26 of edx. */
    if (max >= 1) {
        XXH_CPUID(data, /* eax */ 1);
        if (data[3] & (1 << 26)) {
            cpu_mode = supported_cpu_mode = XXH_CPU_MODE_SSE2;
            XXH3_hashLong = &_XXH3_hashLong_SSE2;
            return;
        }
    }
    /* At this point, we fall back to scalar. */
    cpu_mode = supported_cpu_mode = XXH_CPU_MODE_SCALAR;
    XXH3_hashLong = &_XXH3_hashLong_Scalar;
}

/* Sets up the dispatcher and then calls the actual hash function. */
static void
XXH3_dispatcher(U64* restrict acc, const void* restrict data, size_t len, const U32* restrict key)
{
    /* We haven't checked CPUID yet, so we check it now. On GCC, we try to get this to run
     * at program startup to hide our very dirty secret from the benchmarks. */
    XXH3_featureTest();
    XXH3_hashLong(acc, data, len, key);
}

/* Default the function pointer to the dispatcher. */
static void (*XXH3_hashLong)(U64* acc, const void* data, size_t len, const U32* key) = &XXH3_dispatcher;

#else /* !XXH_MULTI_TARGET */
   /* Include the C file directly and let the compiler decide which implementation to use. */
#  include "xxh3-target.c"
#endif /* XXH_MULTI_TARGET */

/* Sets the XXH3_hashLong variant. When XXH_MULTI_TARGET is not defined, this
 * does nothing.
 *
 * Unless XXH_NO_VERIFY_MULTI_TARGET is defined, this will automatically fall back
 * to the next best XXH3 mode, so, for example, even if you set it to AVX2, the code
 * will not crash even if it is run on, for example, a Core 2 Duo which doesn't support
 * AVX2. */
XXH_PUBLIC_API void XXH3_forceCpuMode(XXH_cpu_mode_t mode)
{
#ifdef XXH_MULTI_TARGET
/* Defining XXH_NO_VERIFY_MULTI_TARGET will allow you to set the CPU mode to
 * an unsupported mode. */
#ifndef XXH_NO_VERIFY_MULTI_TARGET
#   define TRY_SET_MODE(mode, funcptr) \
        if (supported_cpu_mode >= (mode)) { \
            cpu_mode = (mode); \
            XXH3_hashLong = &(funcptr); \
            return; \
        }
    if (supported_cpu_mode == XXH_CPU_MODE_AUTO)
        XXH3_featureTest();
#else
#   define TRY_SET_MODE(mode, funcptr) \
        cpu_mode = (mode); \
        XXH3_hashLong = &(funcptr); \
        return;
#endif

    switch (mode) {
    case XXH_CPU_MODE_AVX2:
        TRY_SET_MODE(XXH_CPU_MODE_AVX2, _XXH3_hashLong_AVX2);
        /* FALLTHROUGH */
    case XXH_CPU_MODE_SSE2:
        TRY_SET_MODE(XXH_CPU_MODE_SSE2, _XXH3_hashLong_SSE2);
        /* FALLTHROUGH */
    case XXH_CPU_MODE_SCALAR:
        cpu_mode = XXH_CPU_MODE_SCALAR;
        XXH3_hashLong = &_XXH3_hashLong_Scalar;
        return;
    case XXH_CPU_MODE_NEON: /* ignored */
    case XXH_CPU_MODE_AUTO:
    default:
        cpu_mode = XXH_CPU_MODE_AUTO;
        XXH3_hashLong = &XXH3_dispatcher;
        return;
    }
#undef TRY_SET_MODE
#endif
}

/* Returns which XXH3 mode we are using. */
XXH_PUBLIC_API XXH_cpu_mode_t XXH3_getCpuMode(void)
{
#ifdef XXH_MULTI_TARGET
    return cpu_mode;
#else
    return (XXH_cpu_mode_t) XXH_VECTOR;
#endif
}
Cyan4973 commented 5 years ago

A hidden requirement on this project is to keep it as a 2 files library if possible (xxhash.c and xxhash.h).

I think this is possible. Last year, I made a similar patch for zstd. The idea was :

In xxh3 case, I believe it's even simpler : since the SSE2, AVX2 and NEON implementations are already explicit, it's not even necessary to have this 2-stages. One can go straight to specialized functions. Just split accumulate512 and scramble into their respective variants. Exclude the unwanted ones from compilation. Then call the wanted one.

I'm fine with function pointer approach.

easyaspi314 commented 5 years ago

I did that for my XXH64 SSE2 implementation; I used __attribute__((__target__)).

However, it is not compatible with MSVC. MSVC is (unfortunately?) very popular among Windows users. The only way to do it for MSVC (unless I am mistaken) is to compile the file multiple times.

__attribute__((__target__)) also messed with conditionally enabled functions on older GCC/Clang versions iirc

The choices are:

  1. Have an extra file, support dispatching on both MSVC, GCC, and Clang
  2. Don't have an extra file, only support dispatching on GCC and Clang
  3. No dispatching.
  4. Really overcomplicated compilation system which will do the equivalent of xxh3-target.c with a single source file.
easyaspi314 commented 5 years ago

Side note, unrelated to XXH3. I was messing with inline assembly for XXH64 and came up with this (pop it in XXH64_endian_align):

        U64 inp1, inp2, inp3, inp4;
        do {
#if defined(__GNUC__) && defined(__x86_64__)
            __asm__(
                "movq      (%[p]),      %[inp1]\n"
                "movq      8(%[p]),     %[inp2]\n"
                "movq      16(%[p]),    %[inp3]\n"
                "movq      24(%[p]),    %[inp4]\n"
                "imulq     %[prime2],   %[inp1]\n"
                "imulq     %[prime2],   %[inp2]\n"
                "imulq     %[prime2],   %[inp3]\n"
                "imulq     %[prime2],   %[inp4]\n"
                "addq      %[inp1],     %[v1]\n"
                "addq      %[inp2],     %[v2]\n"
                "addq      %[inp3],     %[v3]\n"
                "addq      %[inp4],     %[v4]\n"
#if defined(__BMI2__)
                "rorxq     $33,  %[v1], %[v1]\n"
                "rorxq     $33,  %[v2], %[v2]\n"
                "rorxq     $33,  %[v3], %[v3]\n"
                "rorxq     $33,  %[v4], %[v4]\n"
#elif defined(__AVX__)
                "shldq     $31,  %[v1], %[v1]\n"
                "shldq     $31,  %[v2], %[v2]\n"
                "shldq     $31,  %[v3], %[v3]\n"
                "shldq     $31,  %[v4], %[v4]\n"
#else
                "rolq      $31,  %[v1]\n"
                "rolq      $31,  %[v2]\n"
                "rolq      $31,  %[v3]\n"
                "rolq      $31,  %[v4]\n"
#endif
                "imulq     %[prime1],  %[v1]\n"
                "imulq     %[prime1],  %[v2]\n"
                "leaq      32(%[p]),   %[p]\n"
                "imulq     %[prime1],  %[v3]\n"
                "imulq     %[prime1],  %[v4]\n"
              : [p] "+r" (p), [inp1] "=&r" (inp1), [inp2] "=&r" (inp2), [inp3] "=&r" (inp3), [inp4] "=&r" (inp4), [v1] "+r" (v1), [v2] "+r" (v2), [v3] "+r" (v3), [v4] "+r" (v4)
              : [prime1] "r" (PRIME64_1), [prime2] "r" (PRIME64_2));
#else
            v1 = XXH64_round(v1, XXH_get64bits(p)); p+=8;
            v2 = XXH64_round(v2, XXH_get64bits(p)); p+=8;
            v3 = XXH64_round(v3, XXH_get64bits(p)); p+=8;
            v4 = XXH64_round(v4, XXH_get64bits(p)); p+=8;
#endif
        } while (p<=limit);

Just curious, what do you get with -march=native on your chip (I'm assuming it has BMI2).

easyaspi314 commented 5 years ago

Toying with VSX on a ppc64le IBM POWER9 9006-22P machine on the GCC farm. (Which only has vim on it, my least favorite editor :rage:)

Naive VSX implementation I adapted from the terrible documentation and HighwayHash:


#include <altivec.h>
typedef __vector unsigned long long U64x2;
typedef __vector unsigned U32x4;
XXH_FORCE_INLINE U64x2 XXH_multEven(U32x4 a, U32x4 b) {  // NOLINT
  U64x2 result;                                                  // NOLINT
#ifdef __LITTLE_ENDIAN__
  __asm__("vmulouw %0, %1, %2" : "=v"(result) : "v"(a), "v"(b));
#else
  __asm__("vmuleuw %0, %1, %2" : "=v"(result) : "v"(a), "v"(b));
#endif
  return result;
}
XXH_FORCE_INLINE U64x2 XXH_multOdd(U32x4 a, U32x4 b) {  // NOLINT
  U64x2 result;                                                  // NOLINT
#ifdef __LITTLE_ENDIAN__
  __asm__("vmuleuw %0, %1, %2" : "=v"(result) : "v"(a), "v"(b));
#else
  __asm__("vmuloww %0, %1, %2" : "=v"(result) : "v"(a), "v"(b));
#endif
  return result;
}

XXH_FORCE_INLINE void
XXH3_accumulate_512(void *restrict acc, const void *restrict data, const void *restrict key)
{
    U64x2 *const xacc = (U64x2*)acc;
    U64x2 const *const xdata = (U64x2 const*)data;
    U64x2 const *const xkey = (U64x2 const*)key;
    U64x2 const thirtytwo = { 32,  32 };
    size_t i;
    for (i = 0; i < STRIPE_LEN / sizeof(U64x2); i++) {
        U64x2 data_vec = vec_vsx_ld(0, xdata + i);
        U64x2 key_vec = vec_vsx_ld(0, xkey + i);
        U64x2 data_key = data_vec ^ key_vec;
        U32x4 shuffled = (U32x4)vec_rl(data_key, thirtytwo);
        U32x4 data_key32 = (U32x4)data_key;
        U64x2 product = XXH_multEven(data_key32, shuffled);
        xacc[i] += product;
        xacc[i] += data_vec;
    }
}
XXH_FORCE_INLINE void
XXH3_scrambleAcc(void* restrict acc, const void* restrict key)
{
    U64x2*const xacc = (U64x2*)acc;
    const U64x2 *const xkey = (const U64x2*)key;
    U64x2 const thirtytwo = { 32, 32 };
    U32x4 const prime1 = { PRIME32_1, PRIME32_1, PRIME32_1, PRIME32_1 };
    size_t i;
    for (i = 0; i < STRIPE_LEN / sizeof(U64x2); i++) {
        U64x2 const acc_vec = xacc[i];
        U64x2 const data_vec = acc_vec ^ (acc_vec >> 47);
        U64x2 const key_vec = vec_vsx_ld(0, xkey);
        U64x2 const data_key = data_vec ^ key_vec;
        U64x2 const prod_lo = XXH_multEven((U32x4)data_key, prime1);
        U64x2 const prod_hi = XXH_multOdd((U32x4)data_key, prime1);
        xacc[i] = prod_lo + (prod_hi << 32);
    }
}

VSX mode:

./xxhsum 0.7.0 (64-bits ppc64 little endian), GCC 4.8.5 20150623 (Red Hat 4.8.5-36), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    47231 it/s ( 4612.4 MB/s)
XXH32 unaligned     :     102400 ->    51200 it/s ( 5000.0 MB/s)
XXH64               :     102400 ->   153600 it/s (15000.0 MB/s)
XXH64 unaligned     :     102400 ->   153600 it/s (15000.0 MB/s)
XXH3_64bits         :     102400 ->    61440 it/s ( 6000.0 MB/s)
XXH3_64b unaligned  :     102400 ->    51200 it/s ( 5000.0 MB/s)

Scalar code:

./xxhsum 0.7.0 (64-bits ppc64 little endian), GCC 4.8.5 20150623 (Red Hat 4.8.5-36), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    46759 it/s ( 4566.3 MB/s)
XXH32 unaligned     :     102400 ->    51200 it/s ( 5000.0 MB/s)
XXH64               :     102400 ->   153600 it/s (15000.0 MB/s)
XXH64 unaligned     :     102400 ->   111306 it/s (10869.7 MB/s)
XXH3_64bits         :     102400 ->   102400 it/s (10000.0 MB/s)
XXH3_64b unaligned  :     102400 ->    76800 it/s ( 7500.0 MB/s)

However, apparently, since it has GCC 4.8.5, it is terrible code output. I generated some assembly with Clang and linked it together, and I got something much nicer.

./xxhsum 0.7.0 (64-bits ppc64 little endian), GCC 4.8.5 20150623 (Red Hat 4.8.5-36), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    47211 it/s ( 4610.5 MB/s)
XXH32 unaligned     :     102400 ->    51200 it/s ( 5000.0 MB/s)
XXH64               :     102400 ->   153600 it/s (15000.0 MB/s)
XXH64 unaligned     :     102400 ->   153600 it/s (15000.0 MB/s)
XXH3_64bits         :     102400 ->   307200 it/s (30000.0 MB/s)
XXH3_64b unaligned  :     102400 ->   307200 it/s (30000.0 MB/s)

Edit: Yeah, GCC is puking assembly. It doesn't help that the GCC version is so old that it doesn't fully support the chip it was running on.

$ gcc -mcpu=power9
gcc: error: unrecognized argument in option ‘-mcpu=power9’
gcc: note: valid arguments to ‘-mcpu=’ are: 401 403 405 405fp 440 440fp 464 464fp 476 476fp 505 601 602 603 603e 604 604e 620 630 740 7400 7450 750 801 821 823 8540 8548 860 970 G3 G4 G5 a2 cell e300c2 e300c3 e500mc e500mc64 e5500 e6500 ec603e native power3 power4 power5 power5+ power6 power6x power7 power8 powerpc powerpc64 powerpc64le rs64 titan
gcc: fatal error: no input files
compilation terminated.
Cyan4973 commented 5 years ago

Indeed, this is night and day difference

easyaspi314 commented 5 years ago

Tried downloading a clang release tarball

./clang: /lib64/ld64.so.2: version `GLIBC_2.22' not found (required by ./clang)
./clang: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by ./clang)
./clang: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./clang)
./clang: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./clang)

Damn it. Worth a shot.

easyaspi314 commented 5 years ago

Eyy, I got Clang 8 to work.

I had to first compile GCC from source to get libstdc++, then I had to compile glibc from source, which needs Python 3 and a newer make, and then I had to patch Clang to use the correct ld.so and set a wrapper.

Worth it. Time to dump all my changes because it is only going to get more complicated.

But yay, we have VSX support! The server and supercomputer users which still use POWER will be happy.

easyaspi314 commented 5 years ago

OK. Scalar, NEON, VSX, and SSE2 are all failing test 52 (when I remove the first #if 0) with the same value, so I am not too concerned about it.

./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Error: 64-bit hash test 52: Internal sanity check failed!
Got 0x082520CD9A539D2AULL, expected 0x802EB54C97564FD7ULL.
Note: If you modified the hash functions, make sure to either update the values
or temporarily comment out the tests in BMK_sanityCheck.
easyaspi314 commented 5 years ago

PR is open now.

Dispatching, better intrinsic documentation, and VSX support is added.

easyaspi314 commented 5 years ago

Oof. I managed to get GCC to emit some decent multiply code…

mult_hd_llvm:
        push    ebp
        push    edi
        xor     edi, edi
        push    esi
        push    ebx
        sub     esp, 12
        mov     ecx, dword ptr [esp + 32]
        mov     ebx, dword ptr [esp + 40]
        mov     ebp, dword ptr [esp + 36]
        mov     eax, ecx
        mul     ebx
        mov     dword ptr [esp], eax
        mov     eax, ebx
        mov     esi, edx
        mov     dword ptr [esp + 4], edx
        mul     ebp
        add     esi, eax
        mov     eax, ecx
        adc     edi, edx
        mul     dword ptr [esp + 44]
        mov     ecx, eax
        mov     ebx, edx
        xor     edx, edx
        add     ecx, esi
        mov     eax, ebp
        mov     esi, edi
        adc     ebx, edx
        xor     edi, edi
        mul     dword ptr [esp + 44]
        add     eax, esi
        adc     edx, edi
        xor     ebp, ebp
        add     eax, ebx
        mov     ebx, dword ptr [esp]
        adc     edx, ebp
        mov     edi, eax
        add     esp, 12
        xor     edx, ecx
        mov     eax, ebx
        pop     ebx
        xor     eax, edi
        pop     esi
        pop    edi
        pop     ebp
        ret

only by translating LLVM's output…

define i64 @mult_hd(i64, i64) local_unnamed_addr #0 {
  %3 = lshr i64 %0, 32
  %4 = lshr i64 %1, 32
  %5 = and i64 %0, 4294967295
  %6 = and i64 %1, 4294967295
  %7 = mul nuw i64 %6, %5
  %8 = mul nuw i64 %6, %3
  %9 = lshr i64 %7, 32
  %10 = add i64 %9, %8
  %11 = mul nuw i64 %4, %5
  %12 = and i64 %10, 4294967295
  %13 = add i64 %12, %11
  %14 = mul nuw i64 %4, %3
  %15 = lshr i64 %13, 32
  %16 = lshr i64 %10, 32
  %17 = add i64 %16, %14
  %18 = add i64 %17, %15
  %19 = and i64 %7, 4294967295
  %20 = shl i64 %13, 32
  %21 = or i64 %20, %19
  %22 = xor i64 %18, %21
  ret i64 %22
}

…directly to C.

__attribute__((__noinline__, __target__("no-sse2")))
uint64_t mult_hd_clang(uint64_t const p0, uint64_t const p1)
{
    uint64_t p3 = p0 & 0xFFFFFFFF;
    uint64_t p4 = p1 & 0xFFFFFFFF;
    uint64_t p5 = p4 * p3;
    uint64_t p6 = p5 >> 32;
    uint64_t p7 = p0 >> 32;
    uint64_t p8 = p4 * p7;
    uint64_t p9 = p6 + p8;
    uint64_t p10 = p9 >> 32;
    uint64_t p11 = p1 >> 32;
    uint64_t p12 = p11 * p3;
    uint64_t p13 = p9 & 0xFFFFFFFF;
    uint64_t p14 = p13 + p12;
    uint64_t p15 = p14 >> 32;
    uint64_t p16 = p11 * p7;
    uint64_t p17 = p10 + p16;
    uint64_t p18 = p17 + p15;
    uint64_t p19 = p14 << 32;
    uint64_t p20 = p5 & 0xFFFFFFFF;
    uint64_t p21 = p19 | p20;
    uint64_t p22 = p18 ^ p21;
    return p22;
}

One instruction per line. The ELI5 for compilers.

diff --git a/README.md b/README.md
index 96ecfec..ed4f4ec 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Code is highly portable, and hashes are identical on all platforms (little / big
 |master      | [![Build Status](https://travis-ci.org/Cyan4973/xxHash.svg?branch=master)](https://travis-ci.org/Cyan4973/xxHash?branch=master) |
 |dev         | [![Build Status](https://travis-ci.org/Cyan4973/xxHash.svg?branch=dev)](https://travis-ci.org/Cyan4973/xxHash?branch=dev) |

-
+Compile this with Clang. GCC is a highly overrated compiler that can't generate fast code unless you write it like assembly or *in* assembly.

 Benchmarks
 -------------------------

smh

easyaspi314 commented 5 years ago

@Cyan4973 Would this be a proper state struct?

typedef struct {
    U64 acc[8]; /* ACC_NB */
    BYTE input[1024]; /* block_len */
    U64 seed;
    U64 length;
    U16 input_size;
    BYTE seen_large_len; /* >= 1024 */
    BYTE seen_medium_len; /* >= 16 */
    U32 reserved; /* take advantage of padding memes */
} XXH3_state_t;

A hefty struct compared to the 48/88 byte XXH32 and XXH64 states.

Cyan4973 commented 5 years ago

The XXH3 struct state will have to be larger than XXH32 and XXH64. But I believe it will only need to preserve 128 bytes of input. With the accumulator, and other information, it will need 200+ bytes.

easyaspi314 commented 5 years ago

So you are basically saying we should have it start halfway in XXH3_accumulate?

So something like this:

typedef struct {
     U64 acc[8]; /* 0-64 */
     BYTE input[128]; /* 64-192 */
     U64 seed; /* 192-200 */
     U64 total_len; /* 200-208 */
     BYTE input_size; /* 208-209 */
     BYTE round_num; /* 209-210 which accumulate step we are on */
     BYTE seen_large_len; /* 210-211 */
     BYTE seen_medium_len; /* 211-212 */
     U32 reserved; /* 212-216 */
} XXH3_state_t;

This weighs in at a much more reasonable 216 bytes (instead of 1112 bytes), although if we use larger types instead of packing it like I did, it will be a little larger. (So if we use U32, it would be 224 minus the reserved bit, and U64 would be 240), however, since we basically access these once per update, I don't see a reason not to pack them. What do you think?

Cyan4973 commented 5 years ago

Yes, something like that. Also, it's not useful to try to save a few bytes. Actually, it's better to keep a little margin, in order to preserve the ability to introduce new features without breaking ABI (aka, the structure size, once published, should never change).

42Bastian commented 5 years ago

Maybe add a version number, so you can add more elements later and old software can at least detect new structures. And maybe keep it multiple of 32 (or even 64) for cache optimization.

Cyan4973 commented 5 years ago

Adding a version number introduces new initialization problems. I hope we will be able to avoid them. I've only little experience with it, and unfortunately, so far, I have not found that method to be protective enough, while it directly impacts API design.

Keeping the size on a multiple (32/64) is reasonable. We may have issues though with variable-size fields, such as pointers, or size_t.

42Bastian commented 5 years ago

The question is, why to worry about. The state is something local. It is nothing exported from one system to another. So it should be reasonable for the machine it is used on. Therefore, yes, no version is needed. But aligning the elements to benefit from caches might give the one or another µs of performance gain. But I think it is already good aligned.

easyaspi314 commented 5 years ago
#define XXH3_ACCUMULATE_512_X86(vec_t, _mm_, _si) /* generic macro for sse2 and avx2 */     \
    do {                                                                                    \
        assert(((size_t)acc) & (sizeof(vec_t) - 1) == 0);                                   \
        {                                                                                   \
           ALIGN(sizeof(vec_t)) vec_t* const xacc  =       (vec_t *) acc;                   \
           const                vec_t* const xdata = (const vec_t *) data;                  \
           const                vec_t* const xkey  = (const vec_t *) key;                   \
                                                                                            \
           size_t i;                                                                        \
           for (i = 0; i < STRIPE_LEN / sizeof(vec_t); i++) {                               \
               /* data_vec = xdata[i]; */                                                   \
               vec_t const data_vec = _mm_##loadu##_si      (xdata + i);                    \
               /* key_vec  = xkey[i];  */                                                   \
               vec_t const key_vec  = _mm_##loadu##_si      (xkey + i);                     \
               /* data_key = data_vec ^ key_vec; */                                         \
               vec_t const data_key = _mm_##xor##_si        (data_vec, key_vec);            \
               /* shuffled = data_key[1, undef, 3, undef]; // or data_key >> 32; */         \
               vec_t const shuffled = _mm_##shuffle_epi32   (data_key, 0x31);               \
               /* product  = (shuffled & 0xFFFFFFFF) * (data_key & 0xFFFFFFFF); */          \
               vec_t const product  = _mm_##mul_epu32       (shuffled, data_key);           \
                                                                                            \
               /* xacc[i] += data_vec; */                                                   \
               xacc[i] = _mm_##add_epi64 (xacc[i], data_vec);                               \
               /* xacc[i] += product; */                                                    \
               xacc[i] = _mm_##add_epi64 (xacc[i], product);                                \
           }                                                                                \
       }                                                                                    \
   } while (0)
#define XXH3_SCRAMBLE_ACC_X86(vec_t, _mm_, _si) /* generic macro for sse2 and avx2 */       \
    do {                                                                                    \
        assert(((size_t)acc) & (sizeof(vec_t) - 1) == 0);                                   \
        {                                                                                   \
            ALIGN(sizeof(vec_t)) vec_t* const xacc  =       (vec_t *) acc;                  \
            const                vec_t* const xkey  = (const vec_t *) key;                  \
            const vec_t prime = _mm_##set1_epi32 ((int) PRIME32_1);                         \
            size_t i;                                                                       \
            for (i = 0; i < STRIPE_LEN / sizeof(vec_t); i++) {                              \
                /* data_vec = xacc[i] ^ (xacc[i] >> 47); */                                 \
                vec_t const acc_vec  = xacc[i];                                             \
                vec_t const shifted  = _mm_##srli_epi64    (acc_vec, 47);                   \
                vec_t const data_vec = _mm_##xor##_si      (acc_vec, shifted);              \
                                                                                            \
                /* key_vec  = xkey[i]; */                                                   \
                vec_t const key_vec  = _mm_##loadu##_si    (xkey + i);                      \
                /* data_key = data_vec ^ key_vec; */                                        \
                vec_t const data_key = _mm_##xor##_si      (data_vec, key_vec);             \
                /* shuffled = data_key[1, undef, 3, undef]; // data_key >> 32; */           \
                vec_t const shuffled = _mm_##shuffle_epi32 (data_key, 0x31);                \
                                                                                            \
                /* data_key *= PRIME32_1; // 32-bit * 64-bit */                             \
                                                                                            \
                /* prod_hi = data_key >> 32 * PRIME32_1; */                                 \
                vec_t const prod_hi = _mm_##mul_epu32      (shuffled, prime);               \
                /* prod_hi_top = prod_hi << 32; */                                          \
                vec_t const prod_hi_top = _mm_##slli_epi64 (prod_hi, 32);                   \
                /* prod_lo = (data_key & 0xFFFFFFFF) * PRIME32_1; */                        \
                vec_t const prod_lo = _mm_##mul_epu32      (data_key, prime);               \
                /* xacc[i] = prod_hi_top + prod_lo; */                                      \
                xacc[i] = _mm_##add_epi64 (prod_hi_top, prod_lo);                           \
            }                                                                               \
        }                                                                                   \
    } while (0)

@Cyan4973 like this? If we are gonna do the GCC dispatch, this is better for it.

__attribute__((__target__("avx2")))
void XXH3_accumulate_512_AVX2(void *restrict acc, const void *restrict key, const void *restrict data)
{
    XXH3_ACCUMULATE_512_X86(__m256i, _mm256_, _si256);
}
__attribute__((__target__("avx2")))
void XXH3_scrambleAcc_AVX2(void *restrict acc, const void *restrict key)
{
    XXH3_SCRAMBLE_ACC_X86(__m256i, _mm256_, _si256);
}
__attribute__((__target__("sse2")))
void XXH3_accumulate_512_SSE2(void *restrict acc, const void *restrict key, const void *restrict data)
{
    XXH3_ACCUMULATE_512_X86(__m128i, _mm_, _si128);
}
__attribute__((__target__("sse2")))
void XXH3_scrambleAcc_SSE2(void *restrict acc, const void *restrict key)
{
    XXH3_SCRAMBLE_ACC_X86(__m128i, _mm_, _si128);
}
Cyan4973 commented 5 years ago

It's true that AVX2 and SSE2 code path are very similar, featuring only some predictable name change. And I also know your suggested construction works.

But I'm kind of cautious about it. Defining large functions through macros can make debugging more challenging, and even impair readability.

I suspect the main expected benefit is that any change is automatically reported on both variant. However :

All in all, I believe that readability is slightly negatively impacted, while writability benefits slightly.\

Over the long term, it's generally better for a source code to be "reader oriented". So I think I prefer for both variants to be written clearly.

easyaspi314 commented 5 years ago

Ok. Gotcha.

Also, do you want the PPC64 code next?

Cyan4973 commented 5 years ago

Oh yes !

Cyan4973 commented 5 years ago

Opened discussion :

In branch xxh128, I'm testing a new formula for the 128-bit variant of XXH3. The core concept is that each input influences 2 accumulators, hence 128-bit of state. There are still 8 lanes, so total internal state is widened to 1024 bits. This takes more memory, and makes initialization and termination more complex.

The design change was prompted by the claim that in the initial XXH3 proposal, since each lane is 64-bit, the combination of 8x 64-bit lanes is still a 64-bit hash.

I'm currently wondering if this statement is correct.

If it was about a cryptographic hash, it would be correct, indeed. Knowing the internal design of the hash algorithm, an attacker could concentrate its modifications into a single lane, and therefore have to search only through 64-bit, instead of 128-bit, in order to produce a collision.

But for a non-cryptographic hash, this is not supposed to be a scenario. As a checksum, xxHash mission is merely to provide a 128-bit random-like value, and guarantee that any accidental change to the source has a 1 in 2^128 chance to generate the same hash. Almost as good as "none".

So one problem here is to define "accidental". It obviously includes storage and transmission errors, such as truncation, block nullifying, noisy segment, etc. But it can also be something as light as a trivial field update in a larger file, such as a date in a movie's metadata.

The main claim is that a 128-bit hash using XXH3 initial design effectively degenerates into a 64-bit one if all the changes happen in the same lane. A lane is a serie of 8-bytes fields every 64-bytes. As soon as the change impacts 2 lanes, it impacts 128-bit of internal state, and the claim doesn't hold anymore.

Therefore, how probable is it that an accidental change only impacts a single lane ?

Well, that's a hard one, especially as changes and errors are rarely "randomly scattered". In general, multiple consecutive bytes are impacted. And as soon as this range of modified bytes is larger than 8 bytes, it impacts 2+ lanes. And if it's shorter than 8 bytes, then it's not enough to generate a collision.

Bottom line : I'm okay with the fact that a bunch of 64-bit checksum combined do not make a 128-bit cryptographic checksum. But I'm unsure if the statement can be repeated unchanged for non-cryptographic checksums. This would require that a change only impacts one of those smaller checksum. The fact that the lanes are interleaved makes it more difficult for an accidental change to not impact 2+ lanes. Such event must have a probability of appearance. It's clearly not 100%, but how much is it ? Should it be < 1 in 2^64, I'm wondering if it would qualify the 8x64-bit design as "good enough" for a 128-bit checksum

Cyan4973 commented 5 years ago

In branch wsecret, I'm modifying slightly the API to make it "secret first".

It always was the plan to make the algorithm able to consume any external source of secret. Since the secret is involved at every step of the calculation, it makes it more difficult to generate a collision without knowledge of the secret.

With this design in place, the seed is now just a convenient short handle, able to generate a custom secret based on the combination of the default secret and some transform dependent on the seed. It's less powerful, but still makes a decent job at hardening the hash generator.

As a consequence, the result of the hash for len==0 must be updated, as it used to be the seed value, but : 1) The seed is no longer central. The expectation is that users interested in security will rather provide their own secret, which is much richer than 8 bytes. 2) If one makes the hash of the empty string a value depending on the secret, it cannot be the seed, since the simulated "seed" for custom secrets is 0. Therefore different secrets always return the same value 0 anyway. 3) Making seed the return value when len==0 makes it easy for any external observer to get the seed, thus simplifying next step (generate collisions).

As a consequence, I've simplified this part : the hash of the empty string is now always 0, whatever the seed or the custom secret.

I suspect it's a very low priority topic, but if someone believes that the hash of the empty string should change when the secret changes, due to some scenario depending on it, this is still a design decision that can be reviewed during this design phase.

Cyan4973 commented 5 years ago

In branch wsecret, the "secret" is now defined as "any blob of bytes" respecting a minimum size.

This is a small but important change compared to its previous definition, as a table of 32-bit values. This is designed to remove any compatibility complexity with regards to platform's endianess. It's also supposed to remove any alignment restriction, so the "custom secret" can really be anything.

As a consequence, the "secret" must be consumed in a way which is endian independent. This is achieved by XXH3_readKey64(), which effectively delegates to XXH_readLE64() now, which is both endian and alignment independent.

But this read function is bypassed when using the vector instruction set. For SSE2 and AVX2, no problem : code uses _mm_loadu_si128() or _mm256_loadu_si256(), which do not need any alignment restriction, so it's fine. For NEON and VSX, that's less clear to me, because I'm not fluent in these instruction sets and have difficulty to find useful information on Internet (Google search brings up a ton of barely relevant pages). @easyaspi314 , I'll need your help on these. I suspect it's fine, as vld1q_u32() (NEON) and vec_vsx_ld() (VSX) are used for both key and data, and there already was no guarantee of alignment on data, but I wouldn't mind a confirmation.

If this could be confirmed, I could change the code comment in xxhash.h, which currently requires a specific alignment, just out of an abundance of precaution. But if it's not necessary, it's better to remove it, as it's a great flexibility for custom secrets.

A later more complex / convoluted question is if it would be beneficial to use the fact that the default secret kKey can be made aligned in order to improve performance when using it, or when the custom secret can be detected aligned. On x64 with SSE2 or AVX2, the answer is clearly "no", so there's no need to add complexity. I would expect a similar answer for arm64 or ppc64, but I'm less sure about arm32 for example, or mips32. No need to look into this now, this is just a speed refinement, for a later stage.

easyaspi314 commented 5 years ago

Correct, vld1q and vec_vsx_ld are the things you need.

Sorry about the inactivity, I am really busy. Once I get this Calculus done I can write it for you if you need, but it shouldn't be too complicated. Just note that

vec_rl(vec, v32);

Is

vec = (vec << 32) | (vec >> 32);

Which I used because the keys would be in reverse order on big endian.

And

vec_revb

Is your average byte swap.

As for the x86, there is no harm in it being aligned, as Penryn and earlier strongly prefer movdqa over movdqu, but the check is only beneficial on these chips, the overhead of the branch and the removal of the alignment penalty on aligned reads makes it not worth it on newer chips.

ARM32 NEON only has a single cycle penalty; it isn't worth it. However, it will alignment fault if you use vld1q_u64 on GCC (it compiles to vld1.64 q0, [r0:128] which requires r0 to be 16-byte aligned).

VSX has no alignment penalty AFAIK.

Cyan4973 commented 5 years ago

Excellent, thanks for detailed answer @easyaspi314 ! And wish you the best for your end of year evaluations !

Cyan4973 commented 5 years ago

Btw @easyaspi314 ,

I added VSX code path verification on Travis CI, which was previously testing only the "scalar" code path on PowerPC, but unfortunately, compilation fails :

In file included from xxhash.c:1021:0:
xxh3.h: In function ‘XXH3_accumulate_512’:
xxh3.h:500:9: warning: implicit declaration of function ‘vec_revb’ [-Wimplicit-function-declaration]
         U64x2 const data_vec = vec_revb(vec_vsx_ld(0, xdata + i));
         ^
xxh3.h:500:9: error: AltiVec argument passed to unprototyped function

It seems it doesn't know vec_revb() ?

Cyan4973 commented 5 years ago

This documentation seems to imply that vec_revb() availability is tied to compilation flag -mcpu=power9.

I tried it on Ubuntu Trusty, which is the environment used for PowerPC tests on TravisCI, but it doesn't work : this compiler is limited to power8 at best.

This path seems to imply that vec_revb() was later back-ported to power8, so it could work on this cpu too. Unfortunately, it is probably too late for the compiler version in Ubuntu Trusty : tried it on TravisCI, it still fails.

This makes me think that the automated detection of VSX has quite a number of conditions to check to make sure compilation will be successful.

edit : and indeed, if I don't force the VSX mode and let the detection macro do the work, it happily triggers the VSX code path with -m64 -maltivec -mvsx, and then fails due to absence of vec_revb().

edit 2 : I will disable VSX tests on Travis CI for the time being. We still need PowerPC tests for big-endian compatibility.

easyaspi314 commented 5 years ago

Indeed. This was intended for POWER9, I don't know if it would work on POWER8.

I found that compiler support was rather lackkuster, especially on the GCC side. Only like GCC 7 or so supports it, and there don't seem to be a lot of macros to discern the versions. I will have to dig some more.

I know that Google used some inline assembly to work around the terrible compiler support and the inconsistent multiply intrisnics.

Granted, anyone who is using PowerPC64 nowadays is not likely to be the average user who doesn't know the difference between 32-bit and 64-bit or doesn't know what AVX2 is. But I still want to get things working properly.

Cyan4973 commented 5 years ago

With the release of streaming API for XXH3_64bits(), there's only a very limited number of actions remaining before moving on to XXH128().

1) Following @aras-p investigation, it seems preferable to extend the "short mode" to longer sizes, in order to reduce the performance cliff related to "long mode" initialization. This has direct consequences on streaming's state, so that's why it needed to be done in this order. Also, preserving hash quality is non-trivial, but I believe I've got a solution too, with earlier move to "secret first" helping this objective.

2) Wasm performance is still an issue, and it seems that "unrolling" is actually bad for this target. But in order to fix this, I must be able to observe it first. I made a little progress this morning, as emscripten compilation works again on my Mac. But compilation is just a first step : I still need to find a way to actually run a wasm test program and make measurements.

3) Lastly, I'm currently wondering about the introduction of a new property : it seems it could be possible to make XXH3 essentially bijective on len==8, and injective for any len < 8. It basically means that, for 2 inputs of same len <= 8, there would be a guarantee of no collision if both inputs are different (instead of the usual 1 / 2^64 probability). I'm wondering if this would be useful. The drawback is that, in order to keep 1st-class avalanche effect on any subrange of bits of the output, I need to add one multiplication, making the hash a little bit slower (for len <= 8). Neither the property nor the impact seem large, so it's more a matter of preference : which one seems to matter more ?

Zhentar commented 5 years ago

I'll have to think it through more, but my main use case for xxhash is as a hash code for hash table lookups (where the big advantage of a high quality 64bit+ hash code is that you can get away with skipping key equality comparisons in far more scenarios than you could with something like an FNV variant), and I'm pretty sure in at least some use cases I could leverage an explicit guarantee of no collisions for keys <= 8 bytes for far more savings than the cost of a single multiply. So currently my vote would be to make it bijective at the cost of a multiply.

I am curious how this would work with a seed-independent zero hash for zero-length input, though (which I also think is something I could leverage for an optimization on lookups); can you simultaneously guarantee both no collisions len <= 8 and seed independent zero hash for zero length?

aras-p commented 5 years ago

I still need to find a way to actually run a wasm test program and make measurements

Hmm for me running stuff on Wasm was super simple, basically just:

Screen Shot 2019-06-15 at 8 50 25 AM

My "compile wasm" build script is https://github.com/aras-p/HashFunctionsTest/blob/master/compile_emscripten.sh if you need an actual full example

Cyan4973 commented 5 years ago

can you simultaneously guarantee both no collisions len <= 8 and seed independent zero hash for zero length?

It depends on what you mean by "no collisions len <= 8".

Note that above-mentioned guarantee of no collision would be _for 2 inputs of same len <= 8_ . If that's what you meant, then yes, it's possible.

It is impossible to guarantee no collision for any 2 inputs of len <= 8, just due to the pigeon hole principle. Or in more mathematical terms, 2 sets of different size cannot be bijective.

A possible property would be to guarantee no collision for any 2 inputs of len < 8. In which case, the starting set is smaller than the destination one, so it's possible. Unfortunately, I have so far ruled out this property because it would be too easy to generate an intentional secret-independent collision between a input of len < 8, and one of len == 8. However, if that's a desirable property, I could look again to find a better solution.

Cyan4973 commented 5 years ago

Thanks for guidance @aras-p , I'll sure try this setup as soon as possible.

Zhentar commented 5 years ago

Ah, yeah, of course! That makes sense. For my purposes, no same len collisions <=8 is definitely better than no collisions len <8.

Cyan4973 commented 5 years ago

Thanks to @aras-p 's hints, I started benchmarks in WebAssembly mode this morning. I've only used one platform, a recent Mac laptop, featuring a recent Intel Core, and Chrome 74.

Using the "wide range" test, it mostly confirms what we already knew : XXH3 on wasm is a bit faster than XXH32, but slower than XXH64. This is consistent with a non-vectorized scalar code path run on 64-bit cpu.

I'm afraid I don't see a simple and immediate solution. The longer-term hope is that vectorial instructions will start working in a future emcc version, leading to auto-vectorization of the scalar code path, which currently works very well with clang. If that's too far away, or too unsure, another route could be to generate a new code path, using gcc/clang vectorial intrinsics, which are apparently supported by emcc.

The wide range test doesn't allow me to observe the cliff at 128 bytes and 1 KB, because the distance between 2 consecutive measurement is too large (it's a power of 2). I will use the more accurate length-byte test to specifically target these areas, and make sure next changes improve wasm performance.

Cyan4973 commented 5 years ago

With last changes to dev branch, I believe XXH3_64bits is now "feature complete". It incorporates all points mentioned above. I don't plan any more modification, but obviously, the design is still opened to comments and improvements.

Now, all that remains to be done is ... to apply the same mechanisms to the 128-bit variant...

easyaspi314 commented 4 years ago

Yawn… I'm back.

School is finally over, and I have a lot of free time for the next few days.

I have to replace my G3 very soon because the screen is dying. It is still functional, but I don't know for how long. God, I suck at decisions. I also plan to replace the Dell tower later this summer.

@Cyan4973 So, what did I miss?

I was looking at the recent changes, and it looks like the ARM and VSX paths need to be updated.

Could you please give me a quick recap about the recent changes to the algorithm? That would make things a little easier.

easyaspi314 commented 4 years ago

By the way, speaking of decisions, have you decided on how/if we are going to do the dispatcher?

Cyan4973 commented 4 years ago

XXH3_64bits() is stabilized, including its streaming implementation.

XXH3_128bits() is not, and that's my next priority. The current proposal in XXH128() does meet a few requirements, but constraints on performance seem too high, so I'll likely have to update it.

Hence, speaking of updates, prefer concentrating on XXH3_64bits(). I don't remember if NEON and VSX paths are correctly tested / validated.

The dispatcher has a place. I see it more as part of xxhsum than the library itself, but I can be mistaken in describing this separation.

You also proposed an interesting patch reducing instruction size, which looks good. It's been preserved in a feature branch, reroll. But the changes within dev are so large that it's more complex to merge now.

easyaspi314 commented 4 years ago

The dispatcher has a place. I see it more as part of xxhsum than the library itself

Well, as I mentioned before, Google Play Store guidelines require a feature test for ARMv7 NEON support, and as for the AVX2/SSE2 paths, if anyone wants to use the AVX2 path, it is impractical if you ever plan to distribute without a dispatcher. People are idiots and wouldn't know which binary to use. 🤷‍♂

Considering how key reading has changed, it seems that PPC big endian needs an update. I'll do some testing either tomorrow or Friday.

As for the reroll branch, I suggest merge the XXH32/XXH64 changes for now and I'll try to work on XXH3 when I get a chance. Besides, it had a branching issue IIRC.