Closed easyaspi314 closed 3 years ago
I think this is a good documentation. I'm also opened to using latex in the comments. While reading formula is not straightforward without a parser, it remains human-readable. I don't know of any better counter-proposal. Even markdeep defers to LaTex for mathematical formulas.
Since I could not find any more improvement, I guess I'm done with the 64-bit variant.
I'm going to merge the xxh3
branch soon, after updating the self-tests, and verifying portability.
The 128-bits version still needs updating.
Yeah, I want to be sure the NEON code matches because I am a little worried that it isn't.
The 64-bit variant definitely looks good to me.
After that, should we work on that dispatcher? It is mostly complete, but I dunno how well it would work in the field. I presume most people who are distributing binaries would want the dispatcher.
I am also considering dispatching ARMv7a. There are some chips without NEON. However, detecting it is really difficult mainly because even though there is an instruction for it, you can't call it because it has a high privilege level. The Android NDK does have a function for that, and all iOS 5+ devices have it. It's low priority at the moment, though, as these devices are genuinely rare.
However, coming from someone who literally doesn't have a chip with AVX2, supporting both SSE2 and AVX2 in the same binary is far more important. I approximate that 80-90% of the people using x86 chips have no idea whether their chip supports AVX2 or not, so distributing separate builds would be confusing af to them.
Having a dispatcher seems like a good idea, but we'll have to be cautious on its potential runtime impact.
Say, someone just calls XXH3_64bits()
on a small input, expecting low-latency performance,
what does the dispatcher do in this case ? Does it add a test to decide which variant is more appropriate ?
I guess the test is probably fine if the amount of data to process is large enough. Now, how much is "large enough" ?
Quick note : in the near future, I intend to introduce a streaming variant. The streaming variant will require the creation of a state. Such state might help, to store some test result, avoiding to probe the cpu every time.
I have a static variable, as well as a getter and a setter.
On GCC, the test is done before main()
, and other compilers check whether the test has been initialized in XXH3_hashLong
, and do the check there.
#ifdef XXH_MULTI_TARGET /* verified beforehand */
/* Prototypes for our code */
#ifdef __cplusplus
extern "C" {
#endif
void _XXH3_hashLong_AVX2(U64* acc, const void* data, size_t len, const U32* key);
void _XXH3_hashLong_SSE2(U64* acc, const void* data, size_t len, const U32* key);
void _XXH3_hashLong_Scalar(U64* acc, const void* data, size_t len, const U32* key);
#ifdef __cplusplus
}
#endif
/* What hashLong version we decided on. cpuid is a SLOW instruction -- calling it takes anywhere
* from 30-40 to THOUSANDS of cycles, so we really don't want to call it more than once. */
static XXH_cpu_mode_t cpu_mode = XXH_CPU_MODE_AUTO;
/* xxh3-target.c will include this file. If we don't do this, the constructor will be called
* multiple times. We don't want that. */
#if !defined(XXH3_TARGET_C) && defined(__GNUC__)
__attribute__((__constructor__))
#endif
static void
XXH3_featureTest(void)
{
int max, data[4];
/* First, get how many CPUID function parameters there are by calling CPUID with eax = 0. */
XXH_CPUID(data, /* eax */ 0);
max = data[0];
/* AVX2 is on the Extended Features page (eax = 7, ecx = 0), on bit 5 of ebx. */
if (max >= 7) {
XXH_CPUIDEX(data, /* eax */ 7, /* ecx */ 0);
if (data[1] & (1 << 5)) {
cpu_mode = XXH_CPU_MODE_AVX2;
return;
}
}
/* SSE2 is on the Processor Info and Feature Bits page (eax = 1), on bit 26 of edx. */
if (max >= 1) {
XXH_CPUID(data, /* eax */ 1);
if (data[3] & (1 << 26)) {
cpu_mode = XXH_CPU_MODE_SSE2;
return;
}
}
/* Must be scalar. */
cpu_mode = XXH_CPU_MODE_SCALAR;
}
static void
XXH3_hashLong(U64* restrict acc, const void* restrict data, size_t len, const U32* restrict key)
{
/* We haven't checked CPUID yet, so we check it now. On GCC, we try to get this to run
* at program startup to hide our very dirty secret from the benchmarks. */
if (cpu_mode == XXH_CPU_MODE_AUTO) {
XXH3_featureTest();
}
switch (cpu_mode) {
case XXH_CPU_MODE_AVX2:
_XXH3_hashLong_AVX2(acc, data, len, key);
return;
case XXH_CPU_MODE_SSE2:
_XXH3_hashLong_SSE2(acc, data, len, key);
return;
default:
_XXH3_hashLong_Scalar(acc, data, len, key);
return;
}
}
#else /* !XXH_MULTI_TARGET */
/* Include the C file directly and let the compiler decide which implementation to use. */
# include "xxh3-target.c"
#endif /* XXH_MULTI_TARGET */
/* Should we keep this? */
XXH_PUBLIC_API void XXH3_forceCpuMode(XXH_cpu_mode_t mode)
{
#ifdef XXH_MULTI_TARGET
cpu_mode = mode;
#endif
}
/* Should we keep this? */
XXH_PUBLIC_API XXH_cpu_mode_t XXH3_getCpuMode(void)
{
#ifdef XXH_MULTI_TARGET
return cpu_mode;
#else
return (XXH_cpu_mode_t) XXH_VECTOR;
#endif
}
Unless you manually reset it, cpuid is never called more than three times.
As for how it works, xxh3-target.c has the implementation of XXH3_hashLong, but the actual symbol name is defined by a macro, which will just be XXH3_hashLong on non-multitargeting code.
#ifdef XXH_MULTI_TARGET
/* The use of reserved identifiers is intentional; these are not to be used directly. */
# if XXH_VECTOR == XXH_AVX2
# define hashLong _XXH3_hashLong_AVX2
# elif XXH_VECTOR == XXH_SSE2
# define hashLong _XXH3_hashLong_SSE2
# else
# define hashLong _XXH3_hashLong_Scalar
# endif
#else
# define hashLong XXH3_hashLong
#endif
So yeah, I tried to make it as cheap as possible.
I do want to mention that t1ha0
uses a function pointer instead of a jump table. IDK which is better.
#if T1HA_USE_INDIRECT_FUNCTIONS
/* Use IFUNC (GNU ELF indirect functions) to choice implementation at runtime.
* For more info please see
* https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
* and https://sourceware.org/glibc/wiki/GNU_IFUNC */
#if __has_attribute(ifunc)
uint64_t t1ha0(const void *data, size_t len, uint64_t seed)
__attribute__((ifunc("t1ha0_resolve")));
#else
__asm("\t.globl\tt1ha0\n\t.type\tt1ha0, "
"%gnu_indirect_function\n\t.set\tt1ha0,t1ha0_resolve");
#endif /* __has_attribute(ifunc) */
#elif __GNUC_PREREQ(4, 0) || __has_attribute(constructor)
uint64_t (*t1ha0_funcptr)(const void *, size_t, uint64_t);
static __cold void __attribute__((constructor)) t1ha0_init(void) {
t1ha0_funcptr = t1ha0_resolve();
}
#else /* T1HA_USE_INDIRECT_FUNCTIONS */
static __cold uint64_t t1ha0_proxy(const void *data, size_t len,
uint64_t seed) {
t1ha0_funcptr = t1ha0_resolve();
return t1ha0_funcptr(data, len, seed);
}
uint64_t (*t1ha0_funcptr)(const void *, size_t, uint64_t) = t1ha0_proxy;
#endif /* !T1HA_USE_INDIRECT_FUNCTIONS */
#endif /* T1HA0_RUNTIME_SELECT */
For proof, I added a logger in XXH_CPUID which showed every time it was called.
CPUID called!
CPUID called!
CPUID called!
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH3 mode: SSE2
XXH32 : 102400 -> 85294 it/s ( 8329.5 MB/s)
XXH32 unaligned : 102400 -> 49981 it/s ( 4880.9 MB/s)
XXH64 : 102400 -> 74702 it/s ( 7295.1 MB/s)
XXH64 unaligned : 102400 -> 72518 it/s ( 7081.8 MB/s)
XXH3_64bits : 102400 -> 173765 it/s (16969.3 MB/s)
XXH3_64b unaligned : 102400 -> 171794 it/s (16776.8 MB/s)
also, my processor randomly went into super-sayian mode for XXH32. Probably a misread :thinking:
wait, this is happening every time…
HOLD THE PHONE I did this when I was testing ARMv6t2
if (((size_t)input & 3) == 0) {
const U32* p32 = (const U32*) __builtin_assume_aligned(p, 4);
do {
v1 = XXH32_round(v1, *p32++);
v2 = XXH32_round(v1, *p32++);
v3 = XXH32_round(v1, *p32++);
v4 = XXH32_round(v1, *p32++);
} while ((const BYTE*)p32 < limit);
p = (const BYTE*)p32;
} else {
do {
v1 = XXH32_round(v1, XXH_get32bits(p)); p+=4;
v2 = XXH32_round(v2, XXH_get32bits(p)); p+=4;
v3 = XXH32_round(v3, XXH_get32bits(p)); p+=4;
v4 = XXH32_round(v4, XXH_get32bits(p)); p+=4;
} while (p < limit);
}
wait, I stupid. v1 v1 v1 v1. noice one me.
v1 = XXH32_round(v1, *p32++);
v2 = XXH32_round(v1, *p32++);
v3 = XXH32_round(v1, *p32++);
v4 = XXH32_round(v1, *p32++);
nvm, my b.
OK, I think the multi-target looks good. Of great importance :
XXH_MULTI_TARGET
build macro.The second point is particularly important. You would be surprised : package managers actually dislike things which are runtime-decided. They are not end-users. Some of them might actually prefer a clear simple binary behavior, with less maintenance risks, even if it means forgiving performance for some targets.
One thing which is not too clear for me is if this mechanism enforces the existence of xxh3.h
, as a way to generate multiple targets.
xxh3.h
is a temporary file, which will disappear when the algorithm stabilizes : its code will be integrated into xxhash.c
.
xxh3.h is not required, however, I did add xxh3-target.c, which is the file that is included or compiled three times.
This is the only clean way to do it. We could technically use macro hell, but that would be super ugly and confusing.
Basically, the build process is this when running make MULTI_TARGET=1 xxhsum
cc -O3 -DXXH_MULTI_TARGET -c -o xxhsum.o xxhsum.c
cc -O3 -DXXH_MULTI_TARGET -c -o xxhash.o xxhash.c
cc -c -O3 -DXXH_MULTI_TARGET xxh3-target.c -mavx2 -o xxh3-avx2.o
cc -c -O3 -DXXH_MULTI_TARGET xxh3-target.c -msse2 -mno-sse3 -o xxh3-sse2.o
cc -c -O3 -DXXH_MULTI_TARGET xxh3-target.c -mno-sse2 -o xxh3-scalar.o
cc xxhsum.o xxhash.o xxh3-avx2.o xxh3-sse2.o xxh3-scalar.o -o xxhsum
Without it, it is like this: (xxh3-target.c is included like a .h file)
cc -O3 -c -o xxhsum.o xxhsum.c
cc -O3 -c -o xxhash.o xxhash.c
cc xxhsum.o xxhash.o -o xxhsum
Also, I think I am liking the function pointer better. It skips the jump table.
#ifdef XXH_MULTI_TARGET
/* Prototypes for our code */
#ifdef __cplusplus
extern "C" {
#endif
void _XXH3_hashLong_AVX2(U64* acc, const void* data, size_t len, const U32* key);
void _XXH3_hashLong_SSE2(U64* acc, const void* data, size_t len, const U32* key);
void _XXH3_hashLong_Scalar(U64* acc, const void* data, size_t len, const U32* key);
#ifdef __cplusplus
}
#endif
/* What hashLong version we decided on. cpuid is a SLOW instruction -- calling it takes anywhere
* from 30-40 to THOUSANDS of cycles), so we really don't want to call it more than once. */
static XXH_cpu_mode_t cpu_mode = XXH_CPU_MODE_AUTO;
/* The best XXH3 version that is supported. This is used for verification on XXH_setCpuMode to prevent
* a SIGILL. It can be turned off with -DXXH_NO_VERIFY_MULTI_TARGET, in which the selected hash will
* be used unconditionally. */
static XXH_cpu_mode_t supported_cpu_mode = XXH_CPU_MODE_AUTO;
/* We also store this as a function pointer, so we can just jump to it at runtime.
* This matches the technique used by t1ha.
* XXX: Are ifuncs better for ELF? */
static void (*XXH3_hashLong)(U64* acc, const void* data, size_t len, const U32* key);
/* Tests features for x86 targets and sets the cpu_mode and the XXH3_hashLong function pointer
* to the correct value.
*
* On GCC compatible compilers, this will be run at program startup.
*
* xxh3-target.c will include this file. If we don't do this, the constructor will be called
* multiple times. We don't want that. */
#if !defined(XXH3_TARGET_C) && defined(__GNUC__)
__attribute__((__constructor__))
#endif
static void XXH3_featureTest(void)
{
int max, data[4];
/* First, get how many CPUID function parameters there are by calling CPUID with eax = 0. */
XXH_CPUID(data, /* eax */ 0);
max = data[0];
/* AVX2 is on the Extended Features page (eax = 7, ecx = 0), on bit 5 of ebx. */
if (max >= 7) {
XXH_CPUIDEX(data, /* eax */ 7, /* ecx */ 0);
if (data[1] & (1 << 5)) {
cpu_mode = supported_cpu_mode = XXH_CPU_MODE_AVX2;
XXH3_hashLong = &_XXH3_hashLong_AVX2;
return;
}
}
/* SSE2 is on the Processor Info and Feature Bits page (eax = 1), on bit 26 of edx. */
if (max >= 1) {
XXH_CPUID(data, /* eax */ 1);
if (data[3] & (1 << 26)) {
cpu_mode = supported_cpu_mode = XXH_CPU_MODE_SSE2;
XXH3_hashLong = &_XXH3_hashLong_SSE2;
return;
}
}
/* At this point, we fall back to scalar. */
cpu_mode = supported_cpu_mode = XXH_CPU_MODE_SCALAR;
XXH3_hashLong = &_XXH3_hashLong_Scalar;
}
/* Sets up the dispatcher and then calls the actual hash function. */
static void
XXH3_dispatcher(U64* restrict acc, const void* restrict data, size_t len, const U32* restrict key)
{
/* We haven't checked CPUID yet, so we check it now. On GCC, we try to get this to run
* at program startup to hide our very dirty secret from the benchmarks. */
XXH3_featureTest();
XXH3_hashLong(acc, data, len, key);
}
/* Default the function pointer to the dispatcher. */
static void (*XXH3_hashLong)(U64* acc, const void* data, size_t len, const U32* key) = &XXH3_dispatcher;
#else /* !XXH_MULTI_TARGET */
/* Include the C file directly and let the compiler decide which implementation to use. */
# include "xxh3-target.c"
#endif /* XXH_MULTI_TARGET */
/* Sets the XXH3_hashLong variant. When XXH_MULTI_TARGET is not defined, this
* does nothing.
*
* Unless XXH_NO_VERIFY_MULTI_TARGET is defined, this will automatically fall back
* to the next best XXH3 mode, so, for example, even if you set it to AVX2, the code
* will not crash even if it is run on, for example, a Core 2 Duo which doesn't support
* AVX2. */
XXH_PUBLIC_API void XXH3_forceCpuMode(XXH_cpu_mode_t mode)
{
#ifdef XXH_MULTI_TARGET
/* Defining XXH_NO_VERIFY_MULTI_TARGET will allow you to set the CPU mode to
* an unsupported mode. */
#ifndef XXH_NO_VERIFY_MULTI_TARGET
# define TRY_SET_MODE(mode, funcptr) \
if (supported_cpu_mode >= (mode)) { \
cpu_mode = (mode); \
XXH3_hashLong = &(funcptr); \
return; \
}
if (supported_cpu_mode == XXH_CPU_MODE_AUTO)
XXH3_featureTest();
#else
# define TRY_SET_MODE(mode, funcptr) \
cpu_mode = (mode); \
XXH3_hashLong = &(funcptr); \
return;
#endif
switch (mode) {
case XXH_CPU_MODE_AVX2:
TRY_SET_MODE(XXH_CPU_MODE_AVX2, _XXH3_hashLong_AVX2);
/* FALLTHROUGH */
case XXH_CPU_MODE_SSE2:
TRY_SET_MODE(XXH_CPU_MODE_SSE2, _XXH3_hashLong_SSE2);
/* FALLTHROUGH */
case XXH_CPU_MODE_SCALAR:
cpu_mode = XXH_CPU_MODE_SCALAR;
XXH3_hashLong = &_XXH3_hashLong_Scalar;
return;
case XXH_CPU_MODE_NEON: /* ignored */
case XXH_CPU_MODE_AUTO:
default:
cpu_mode = XXH_CPU_MODE_AUTO;
XXH3_hashLong = &XXH3_dispatcher;
return;
}
#undef TRY_SET_MODE
#endif
}
/* Returns which XXH3 mode we are using. */
XXH_PUBLIC_API XXH_cpu_mode_t XXH3_getCpuMode(void)
{
#ifdef XXH_MULTI_TARGET
return cpu_mode;
#else
return (XXH_cpu_mode_t) XXH_VECTOR;
#endif
}
A hidden requirement on this project is to keep it as a 2 files library if possible (xxhash.c
and xxhash.h
).
I think this is possible. Last year, I made a similar patch for zstd
.
The idea was :
_internal
function, which carries the code to be specialized. It's never call directly, and must be inlined.__attribute__
to specialize for a target, and calling the _internal
function, which is now inlined, hence implemented, several times.In xxh3
case, I believe it's even simpler : since the SSE2
, AVX2
and NEON
implementations are already explicit, it's not even necessary to have this 2-stages. One can go straight to specialized functions. Just split accumulate512
and scramble
into their respective variants. Exclude the unwanted ones from compilation. Then call the wanted one.
I'm fine with function pointer approach.
I did that for my XXH64 SSE2 implementation; I used __attribute__((__target__))
.
However, it is not compatible with MSVC. MSVC is (unfortunately?) very popular among Windows users. The only way to do it for MSVC (unless I am mistaken) is to compile the file multiple times.
__attribute__((__target__))
also messed with conditionally enabled functions on older GCC/Clang versions iirc
The choices are:
Side note, unrelated to XXH3. I was messing with inline assembly for XXH64 and came up with this (pop it in XXH64_endian_align):
U64 inp1, inp2, inp3, inp4;
do {
#if defined(__GNUC__) && defined(__x86_64__)
__asm__(
"movq (%[p]), %[inp1]\n"
"movq 8(%[p]), %[inp2]\n"
"movq 16(%[p]), %[inp3]\n"
"movq 24(%[p]), %[inp4]\n"
"imulq %[prime2], %[inp1]\n"
"imulq %[prime2], %[inp2]\n"
"imulq %[prime2], %[inp3]\n"
"imulq %[prime2], %[inp4]\n"
"addq %[inp1], %[v1]\n"
"addq %[inp2], %[v2]\n"
"addq %[inp3], %[v3]\n"
"addq %[inp4], %[v4]\n"
#if defined(__BMI2__)
"rorxq $33, %[v1], %[v1]\n"
"rorxq $33, %[v2], %[v2]\n"
"rorxq $33, %[v3], %[v3]\n"
"rorxq $33, %[v4], %[v4]\n"
#elif defined(__AVX__)
"shldq $31, %[v1], %[v1]\n"
"shldq $31, %[v2], %[v2]\n"
"shldq $31, %[v3], %[v3]\n"
"shldq $31, %[v4], %[v4]\n"
#else
"rolq $31, %[v1]\n"
"rolq $31, %[v2]\n"
"rolq $31, %[v3]\n"
"rolq $31, %[v4]\n"
#endif
"imulq %[prime1], %[v1]\n"
"imulq %[prime1], %[v2]\n"
"leaq 32(%[p]), %[p]\n"
"imulq %[prime1], %[v3]\n"
"imulq %[prime1], %[v4]\n"
: [p] "+r" (p), [inp1] "=&r" (inp1), [inp2] "=&r" (inp2), [inp3] "=&r" (inp3), [inp4] "=&r" (inp4), [v1] "+r" (v1), [v2] "+r" (v2), [v3] "+r" (v3), [v4] "+r" (v4)
: [prime1] "r" (PRIME64_1), [prime2] "r" (PRIME64_2));
#else
v1 = XXH64_round(v1, XXH_get64bits(p)); p+=8;
v2 = XXH64_round(v2, XXH_get64bits(p)); p+=8;
v3 = XXH64_round(v3, XXH_get64bits(p)); p+=8;
v4 = XXH64_round(v4, XXH_get64bits(p)); p+=8;
#endif
} while (p<=limit);
Just curious, what do you get with -march=native on your chip (I'm assuming it has BMI2).
Toying with VSX on a ppc64le IBM POWER9 9006-22P machine on the GCC farm. (Which only has vim on it, my least favorite editor :rage:)
Naive VSX implementation I adapted from the terrible documentation and HighwayHash:
#include <altivec.h>
typedef __vector unsigned long long U64x2;
typedef __vector unsigned U32x4;
XXH_FORCE_INLINE U64x2 XXH_multEven(U32x4 a, U32x4 b) { // NOLINT
U64x2 result; // NOLINT
#ifdef __LITTLE_ENDIAN__
__asm__("vmulouw %0, %1, %2" : "=v"(result) : "v"(a), "v"(b));
#else
__asm__("vmuleuw %0, %1, %2" : "=v"(result) : "v"(a), "v"(b));
#endif
return result;
}
XXH_FORCE_INLINE U64x2 XXH_multOdd(U32x4 a, U32x4 b) { // NOLINT
U64x2 result; // NOLINT
#ifdef __LITTLE_ENDIAN__
__asm__("vmuleuw %0, %1, %2" : "=v"(result) : "v"(a), "v"(b));
#else
__asm__("vmuloww %0, %1, %2" : "=v"(result) : "v"(a), "v"(b));
#endif
return result;
}
XXH_FORCE_INLINE void
XXH3_accumulate_512(void *restrict acc, const void *restrict data, const void *restrict key)
{
U64x2 *const xacc = (U64x2*)acc;
U64x2 const *const xdata = (U64x2 const*)data;
U64x2 const *const xkey = (U64x2 const*)key;
U64x2 const thirtytwo = { 32, 32 };
size_t i;
for (i = 0; i < STRIPE_LEN / sizeof(U64x2); i++) {
U64x2 data_vec = vec_vsx_ld(0, xdata + i);
U64x2 key_vec = vec_vsx_ld(0, xkey + i);
U64x2 data_key = data_vec ^ key_vec;
U32x4 shuffled = (U32x4)vec_rl(data_key, thirtytwo);
U32x4 data_key32 = (U32x4)data_key;
U64x2 product = XXH_multEven(data_key32, shuffled);
xacc[i] += product;
xacc[i] += data_vec;
}
}
XXH_FORCE_INLINE void
XXH3_scrambleAcc(void* restrict acc, const void* restrict key)
{
U64x2*const xacc = (U64x2*)acc;
const U64x2 *const xkey = (const U64x2*)key;
U64x2 const thirtytwo = { 32, 32 };
U32x4 const prime1 = { PRIME32_1, PRIME32_1, PRIME32_1, PRIME32_1 };
size_t i;
for (i = 0; i < STRIPE_LEN / sizeof(U64x2); i++) {
U64x2 const acc_vec = xacc[i];
U64x2 const data_vec = acc_vec ^ (acc_vec >> 47);
U64x2 const key_vec = vec_vsx_ld(0, xkey);
U64x2 const data_key = data_vec ^ key_vec;
U64x2 const prod_lo = XXH_multEven((U32x4)data_key, prime1);
U64x2 const prod_hi = XXH_multOdd((U32x4)data_key, prime1);
xacc[i] = prod_lo + (prod_hi << 32);
}
}
VSX mode:
./xxhsum 0.7.0 (64-bits ppc64 little endian), GCC 4.8.5 20150623 (Red Hat 4.8.5-36), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 47231 it/s ( 4612.4 MB/s)
XXH32 unaligned : 102400 -> 51200 it/s ( 5000.0 MB/s)
XXH64 : 102400 -> 153600 it/s (15000.0 MB/s)
XXH64 unaligned : 102400 -> 153600 it/s (15000.0 MB/s)
XXH3_64bits : 102400 -> 61440 it/s ( 6000.0 MB/s)
XXH3_64b unaligned : 102400 -> 51200 it/s ( 5000.0 MB/s)
Scalar code:
./xxhsum 0.7.0 (64-bits ppc64 little endian), GCC 4.8.5 20150623 (Red Hat 4.8.5-36), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 46759 it/s ( 4566.3 MB/s)
XXH32 unaligned : 102400 -> 51200 it/s ( 5000.0 MB/s)
XXH64 : 102400 -> 153600 it/s (15000.0 MB/s)
XXH64 unaligned : 102400 -> 111306 it/s (10869.7 MB/s)
XXH3_64bits : 102400 -> 102400 it/s (10000.0 MB/s)
XXH3_64b unaligned : 102400 -> 76800 it/s ( 7500.0 MB/s)
However, apparently, since it has GCC 4.8.5, it is terrible code output. I generated some assembly with Clang and linked it together, and I got something much nicer.
./xxhsum 0.7.0 (64-bits ppc64 little endian), GCC 4.8.5 20150623 (Red Hat 4.8.5-36), by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 47211 it/s ( 4610.5 MB/s)
XXH32 unaligned : 102400 -> 51200 it/s ( 5000.0 MB/s)
XXH64 : 102400 -> 153600 it/s (15000.0 MB/s)
XXH64 unaligned : 102400 -> 153600 it/s (15000.0 MB/s)
XXH3_64bits : 102400 -> 307200 it/s (30000.0 MB/s)
XXH3_64b unaligned : 102400 -> 307200 it/s (30000.0 MB/s)
Edit: Yeah, GCC is puking assembly. It doesn't help that the GCC version is so old that it doesn't fully support the chip it was running on.
$ gcc -mcpu=power9
gcc: error: unrecognized argument in option ‘-mcpu=power9’
gcc: note: valid arguments to ‘-mcpu=’ are: 401 403 405 405fp 440 440fp 464 464fp 476 476fp 505 601 602 603 603e 604 604e 620 630 740 7400 7450 750 801 821 823 8540 8548 860 970 G3 G4 G5 a2 cell e300c2 e300c3 e500mc e500mc64 e5500 e6500 ec603e native power3 power4 power5 power5+ power6 power6x power7 power8 powerpc powerpc64 powerpc64le rs64 titan
gcc: fatal error: no input files
compilation terminated.
Indeed, this is night and day difference
Tried downloading a clang release tarball
./clang: /lib64/ld64.so.2: version `GLIBC_2.22' not found (required by ./clang)
./clang: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by ./clang)
./clang: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./clang)
./clang: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./clang)
Damn it. Worth a shot.
Eyy, I got Clang 8 to work.
I had to first compile GCC from source to get libstdc++, then I had to compile glibc from source, which needs Python 3 and a newer make, and then I had to patch Clang to use the correct ld.so and set a wrapper.
Worth it. Time to dump all my changes because it is only going to get more complicated.
But yay, we have VSX support! The server and supercomputer users which still use POWER will be happy.
OK. Scalar, NEON, VSX, and SSE2 are all failing test 52 (when I remove the first #if 0
) with the same value, so I am not too concerned about it.
./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Error: 64-bit hash test 52: Internal sanity check failed!
Got 0x082520CD9A539D2AULL, expected 0x802EB54C97564FD7ULL.
Note: If you modified the hash functions, make sure to either update the values
or temporarily comment out the tests in BMK_sanityCheck.
PR is open now.
Dispatching, better intrinsic documentation, and VSX support is added.
Oof. I managed to get GCC to emit some decent multiply code…
mult_hd_llvm:
push ebp
push edi
xor edi, edi
push esi
push ebx
sub esp, 12
mov ecx, dword ptr [esp + 32]
mov ebx, dword ptr [esp + 40]
mov ebp, dword ptr [esp + 36]
mov eax, ecx
mul ebx
mov dword ptr [esp], eax
mov eax, ebx
mov esi, edx
mov dword ptr [esp + 4], edx
mul ebp
add esi, eax
mov eax, ecx
adc edi, edx
mul dword ptr [esp + 44]
mov ecx, eax
mov ebx, edx
xor edx, edx
add ecx, esi
mov eax, ebp
mov esi, edi
adc ebx, edx
xor edi, edi
mul dword ptr [esp + 44]
add eax, esi
adc edx, edi
xor ebp, ebp
add eax, ebx
mov ebx, dword ptr [esp]
adc edx, ebp
mov edi, eax
add esp, 12
xor edx, ecx
mov eax, ebx
pop ebx
xor eax, edi
pop esi
pop edi
pop ebp
ret
only by translating LLVM's output…
define i64 @mult_hd(i64, i64) local_unnamed_addr #0 {
%3 = lshr i64 %0, 32
%4 = lshr i64 %1, 32
%5 = and i64 %0, 4294967295
%6 = and i64 %1, 4294967295
%7 = mul nuw i64 %6, %5
%8 = mul nuw i64 %6, %3
%9 = lshr i64 %7, 32
%10 = add i64 %9, %8
%11 = mul nuw i64 %4, %5
%12 = and i64 %10, 4294967295
%13 = add i64 %12, %11
%14 = mul nuw i64 %4, %3
%15 = lshr i64 %13, 32
%16 = lshr i64 %10, 32
%17 = add i64 %16, %14
%18 = add i64 %17, %15
%19 = and i64 %7, 4294967295
%20 = shl i64 %13, 32
%21 = or i64 %20, %19
%22 = xor i64 %18, %21
ret i64 %22
}
…directly to C.
__attribute__((__noinline__, __target__("no-sse2")))
uint64_t mult_hd_clang(uint64_t const p0, uint64_t const p1)
{
uint64_t p3 = p0 & 0xFFFFFFFF;
uint64_t p4 = p1 & 0xFFFFFFFF;
uint64_t p5 = p4 * p3;
uint64_t p6 = p5 >> 32;
uint64_t p7 = p0 >> 32;
uint64_t p8 = p4 * p7;
uint64_t p9 = p6 + p8;
uint64_t p10 = p9 >> 32;
uint64_t p11 = p1 >> 32;
uint64_t p12 = p11 * p3;
uint64_t p13 = p9 & 0xFFFFFFFF;
uint64_t p14 = p13 + p12;
uint64_t p15 = p14 >> 32;
uint64_t p16 = p11 * p7;
uint64_t p17 = p10 + p16;
uint64_t p18 = p17 + p15;
uint64_t p19 = p14 << 32;
uint64_t p20 = p5 & 0xFFFFFFFF;
uint64_t p21 = p19 | p20;
uint64_t p22 = p18 ^ p21;
return p22;
}
One instruction per line. The ELI5 for compilers.
diff --git a/README.md b/README.md
index 96ecfec..ed4f4ec 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Code is highly portable, and hashes are identical on all platforms (little / big
|master | [![Build Status](https://travis-ci.org/Cyan4973/xxHash.svg?branch=master)](https://travis-ci.org/Cyan4973/xxHash?branch=master) |
|dev | [![Build Status](https://travis-ci.org/Cyan4973/xxHash.svg?branch=dev)](https://travis-ci.org/Cyan4973/xxHash?branch=dev) |
-
+Compile this with Clang. GCC is a highly overrated compiler that can't generate fast code unless you write it like assembly or *in* assembly.
Benchmarks
-------------------------
smh
@Cyan4973 Would this be a proper state struct?
typedef struct {
U64 acc[8]; /* ACC_NB */
BYTE input[1024]; /* block_len */
U64 seed;
U64 length;
U16 input_size;
BYTE seen_large_len; /* >= 1024 */
BYTE seen_medium_len; /* >= 16 */
U32 reserved; /* take advantage of padding memes */
} XXH3_state_t;
A hefty struct compared to the 48/88 byte XXH32 and XXH64 states.
The XXH3
struct state will have to be larger than XXH32
and XXH64
.
But I believe it will only need to preserve 128 bytes of input.
With the accumulator, and other information, it will need 200+ bytes.
So you are basically saying we should have it start halfway in XXH3_accumulate?
So something like this:
typedef struct {
U64 acc[8]; /* 0-64 */
BYTE input[128]; /* 64-192 */
U64 seed; /* 192-200 */
U64 total_len; /* 200-208 */
BYTE input_size; /* 208-209 */
BYTE round_num; /* 209-210 which accumulate step we are on */
BYTE seen_large_len; /* 210-211 */
BYTE seen_medium_len; /* 211-212 */
U32 reserved; /* 212-216 */
} XXH3_state_t;
This weighs in at a much more reasonable 216 bytes (instead of 1112 bytes), although if we use larger types instead of packing it like I did, it will be a little larger. (So if we use U32, it would be 224 minus the reserved bit, and U64 would be 240), however, since we basically access these once per update, I don't see a reason not to pack them. What do you think?
Yes, something like that. Also, it's not useful to try to save a few bytes. Actually, it's better to keep a little margin, in order to preserve the ability to introduce new features without breaking ABI (aka, the structure size, once published, should never change).
Maybe add a version number, so you can add more elements later and old software can at least detect new structures. And maybe keep it multiple of 32 (or even 64) for cache optimization.
Adding a version number introduces new initialization problems. I hope we will be able to avoid them. I've only little experience with it, and unfortunately, so far, I have not found that method to be protective enough, while it directly impacts API design.
Keeping the size on a multiple (32/64) is reasonable.
We may have issues though with variable-size fields, such as pointers, or size_t
.
The question is, why to worry about. The state is something local. It is nothing exported from one system to another. So it should be reasonable for the machine it is used on. Therefore, yes, no version is needed. But aligning the elements to benefit from caches might give the one or another µs of performance gain. But I think it is already good aligned.
#define XXH3_ACCUMULATE_512_X86(vec_t, _mm_, _si) /* generic macro for sse2 and avx2 */ \
do { \
assert(((size_t)acc) & (sizeof(vec_t) - 1) == 0); \
{ \
ALIGN(sizeof(vec_t)) vec_t* const xacc = (vec_t *) acc; \
const vec_t* const xdata = (const vec_t *) data; \
const vec_t* const xkey = (const vec_t *) key; \
\
size_t i; \
for (i = 0; i < STRIPE_LEN / sizeof(vec_t); i++) { \
/* data_vec = xdata[i]; */ \
vec_t const data_vec = _mm_##loadu##_si (xdata + i); \
/* key_vec = xkey[i]; */ \
vec_t const key_vec = _mm_##loadu##_si (xkey + i); \
/* data_key = data_vec ^ key_vec; */ \
vec_t const data_key = _mm_##xor##_si (data_vec, key_vec); \
/* shuffled = data_key[1, undef, 3, undef]; // or data_key >> 32; */ \
vec_t const shuffled = _mm_##shuffle_epi32 (data_key, 0x31); \
/* product = (shuffled & 0xFFFFFFFF) * (data_key & 0xFFFFFFFF); */ \
vec_t const product = _mm_##mul_epu32 (shuffled, data_key); \
\
/* xacc[i] += data_vec; */ \
xacc[i] = _mm_##add_epi64 (xacc[i], data_vec); \
/* xacc[i] += product; */ \
xacc[i] = _mm_##add_epi64 (xacc[i], product); \
} \
} \
} while (0)
#define XXH3_SCRAMBLE_ACC_X86(vec_t, _mm_, _si) /* generic macro for sse2 and avx2 */ \
do { \
assert(((size_t)acc) & (sizeof(vec_t) - 1) == 0); \
{ \
ALIGN(sizeof(vec_t)) vec_t* const xacc = (vec_t *) acc; \
const vec_t* const xkey = (const vec_t *) key; \
const vec_t prime = _mm_##set1_epi32 ((int) PRIME32_1); \
size_t i; \
for (i = 0; i < STRIPE_LEN / sizeof(vec_t); i++) { \
/* data_vec = xacc[i] ^ (xacc[i] >> 47); */ \
vec_t const acc_vec = xacc[i]; \
vec_t const shifted = _mm_##srli_epi64 (acc_vec, 47); \
vec_t const data_vec = _mm_##xor##_si (acc_vec, shifted); \
\
/* key_vec = xkey[i]; */ \
vec_t const key_vec = _mm_##loadu##_si (xkey + i); \
/* data_key = data_vec ^ key_vec; */ \
vec_t const data_key = _mm_##xor##_si (data_vec, key_vec); \
/* shuffled = data_key[1, undef, 3, undef]; // data_key >> 32; */ \
vec_t const shuffled = _mm_##shuffle_epi32 (data_key, 0x31); \
\
/* data_key *= PRIME32_1; // 32-bit * 64-bit */ \
\
/* prod_hi = data_key >> 32 * PRIME32_1; */ \
vec_t const prod_hi = _mm_##mul_epu32 (shuffled, prime); \
/* prod_hi_top = prod_hi << 32; */ \
vec_t const prod_hi_top = _mm_##slli_epi64 (prod_hi, 32); \
/* prod_lo = (data_key & 0xFFFFFFFF) * PRIME32_1; */ \
vec_t const prod_lo = _mm_##mul_epu32 (data_key, prime); \
/* xacc[i] = prod_hi_top + prod_lo; */ \
xacc[i] = _mm_##add_epi64 (prod_hi_top, prod_lo); \
} \
} \
} while (0)
@Cyan4973 like this? If we are gonna do the GCC dispatch, this is better for it.
__attribute__((__target__("avx2")))
void XXH3_accumulate_512_AVX2(void *restrict acc, const void *restrict key, const void *restrict data)
{
XXH3_ACCUMULATE_512_X86(__m256i, _mm256_, _si256);
}
__attribute__((__target__("avx2")))
void XXH3_scrambleAcc_AVX2(void *restrict acc, const void *restrict key)
{
XXH3_SCRAMBLE_ACC_X86(__m256i, _mm256_, _si256);
}
__attribute__((__target__("sse2")))
void XXH3_accumulate_512_SSE2(void *restrict acc, const void *restrict key, const void *restrict data)
{
XXH3_ACCUMULATE_512_X86(__m128i, _mm_, _si128);
}
__attribute__((__target__("sse2")))
void XXH3_scrambleAcc_SSE2(void *restrict acc, const void *restrict key)
{
XXH3_SCRAMBLE_ACC_X86(__m128i, _mm_, _si128);
}
It's true that AVX2
and SSE2
code path are very similar, featuring only some predictable name change.
And I also know your suggested construction works.
But I'm kind of cautious about it. Defining large functions through macros can make debugging more challenging, and even impair readability.
I suspect the main expected benefit is that any change is automatically reported on both variant. However :
All in all, I believe that readability is slightly negatively impacted, while writability benefits slightly.\
Over the long term, it's generally better for a source code to be "reader oriented". So I think I prefer for both variants to be written clearly.
Ok. Gotcha.
Also, do you want the PPC64 code next?
Oh yes !
Opened discussion :
In branch xxh128
, I'm testing a new formula for the 128-bit variant of XXH3
.
The core concept is that each input influences 2 accumulators, hence 128-bit of state.
There are still 8 lanes, so total internal state is widened to 1024 bits.
This takes more memory, and makes initialization and termination more complex.
The design change was prompted by the claim that in the initial XXH3
proposal, since each lane is 64-bit, the combination of 8x 64-bit lanes is still a 64-bit hash.
I'm currently wondering if this statement is correct.
If it was about a cryptographic hash, it would be correct, indeed. Knowing the internal design of the hash algorithm, an attacker could concentrate its modifications into a single lane, and therefore have to search only through 64-bit, instead of 128-bit, in order to produce a collision.
But for a non-cryptographic hash, this is not supposed to be a scenario. As a checksum, xxHash mission is merely to provide a 128-bit random-like value, and guarantee that any accidental change to the source has a 1 in 2^128 chance to generate the same hash. Almost as good as "none".
So one problem here is to define "accidental". It obviously includes storage and transmission errors, such as truncation, block nullifying, noisy segment, etc. But it can also be something as light as a trivial field update in a larger file, such as a date in a movie's metadata.
The main claim is that a 128-bit hash using XXH3
initial design effectively degenerates into a 64-bit one if all the changes happen in the same lane. A lane is a serie of 8-bytes fields every 64-bytes. As soon as the change impacts 2 lanes, it impacts 128-bit of internal state, and the claim doesn't hold anymore.
Therefore, how probable is it that an accidental change only impacts a single lane ?
Well, that's a hard one, especially as changes and errors are rarely "randomly scattered". In general, multiple consecutive bytes are impacted. And as soon as this range of modified bytes is larger than 8 bytes, it impacts 2+ lanes. And if it's shorter than 8 bytes, then it's not enough to generate a collision.
Bottom line : I'm okay with the fact that a bunch of 64-bit checksum combined do not make a 128-bit cryptographic checksum. But I'm unsure if the statement can be repeated unchanged for non-cryptographic checksums. This would require that a change only impacts one of those smaller checksum. The fact that the lanes are interleaved makes it more difficult for an accidental change to not impact 2+ lanes. Such event must have a probability of appearance. It's clearly not 100%, but how much is it ? Should it be < 1 in 2^64, I'm wondering if it would qualify the 8x64-bit design as "good enough" for a 128-bit checksum
In branch wsecret
, I'm modifying slightly the API to make it "secret first".
It always was the plan to make the algorithm able to consume any external source of secret. Since the secret is involved at every step of the calculation, it makes it more difficult to generate a collision without knowledge of the secret.
With this design in place, the seed
is now just a convenient short handle, able to generate a custom secret based on the combination of the default secret and some transform dependent on the seed
. It's less powerful, but still makes a decent job at hardening the hash generator.
As a consequence, the result of the hash for len==0
must be updated, as it used to be the seed
value, but :
1) The seed
is no longer central. The expectation is that users interested in security will rather provide their own secret, which is much richer than 8 bytes.
2) If one makes the hash of the empty string a value depending on the secret, it cannot be the seed
, since the simulated "seed" for custom secrets is 0
. Therefore different secrets always return the same value 0
anyway.
3) Making seed
the return value when len==0
makes it easy for any external observer to get the seed
, thus simplifying next step (generate collisions).
As a consequence, I've simplified this part : the hash of the empty string is now always 0
, whatever the seed
or the custom secret.
I suspect it's a very low priority topic, but if someone believes that the hash of the empty string should change when the secret changes, due to some scenario depending on it, this is still a design decision that can be reviewed during this design phase.
In branch wsecret
, the "secret" is now defined as "any blob of bytes" respecting a minimum size.
This is a small but important change compared to its previous definition, as a table of 32-bit values. This is designed to remove any compatibility complexity with regards to platform's endianess. It's also supposed to remove any alignment restriction, so the "custom secret" can really be anything.
As a consequence, the "secret" must be consumed in a way which is endian independent.
This is achieved by XXH3_readKey64()
, which effectively delegates to XXH_readLE64()
now, which is both endian and alignment independent.
But this read function is bypassed when using the vector instruction set.
For SSE2
and AVX2
, no problem : code uses _mm_loadu_si128()
or _mm256_loadu_si256()
, which do not need any alignment restriction, so it's fine.
For NEON
and VSX
, that's less clear to me, because I'm not fluent in these instruction sets and have difficulty to find useful information on Internet (Google search brings up a ton of barely relevant pages). @easyaspi314 , I'll need your help on these. I suspect it's fine, as vld1q_u32()
(NEON
) and vec_vsx_ld()
(VSX
) are used for both key and data, and there already was no guarantee of alignment on data, but I wouldn't mind a confirmation.
If this could be confirmed, I could change the code comment in xxhash.h
, which currently requires a specific alignment, just out of an abundance of precaution. But if it's not necessary, it's better to remove it, as it's a great flexibility for custom secrets.
A later more complex / convoluted question is if it would be beneficial to use the fact that the default secret kKey
can be made aligned in order to improve performance when using it, or when the custom secret can be detected aligned. On x64
with SSE2
or AVX2
, the answer is clearly "no", so there's no need to add complexity. I would expect a similar answer for arm64
or ppc64
, but I'm less sure about arm32
for example, or mips32
.
No need to look into this now, this is just a speed refinement, for a later stage.
Correct, vld1q and vec_vsx_ld are the things you need.
Sorry about the inactivity, I am really busy. Once I get this Calculus done I can write it for you if you need, but it shouldn't be too complicated. Just note that
vec_rl(vec, v32);
Is
vec = (vec << 32) | (vec >> 32);
Which I used because the keys would be in reverse order on big endian.
And
vec_revb
Is your average byte swap.
As for the x86, there is no harm in it being aligned, as Penryn and earlier strongly prefer movdqa over movdqu, but the check is only beneficial on these chips, the overhead of the branch and the removal of the alignment penalty on aligned reads makes it not worth it on newer chips.
ARM32 NEON only has a single cycle penalty; it isn't worth it. However, it will alignment fault if you use vld1q_u64 on GCC (it compiles to vld1.64 q0, [r0:128]
which requires r0 to be 16-byte aligned).
VSX has no alignment penalty AFAIK.
Excellent, thanks for detailed answer @easyaspi314 ! And wish you the best for your end of year evaluations !
Btw @easyaspi314 ,
I added VSX
code path verification on Travis CI,
which was previously testing only the "scalar" code path on PowerPC,
but unfortunately, compilation fails :
In file included from xxhash.c:1021:0:
xxh3.h: In function ‘XXH3_accumulate_512’:
xxh3.h:500:9: warning: implicit declaration of function ‘vec_revb’ [-Wimplicit-function-declaration]
U64x2 const data_vec = vec_revb(vec_vsx_ld(0, xdata + i));
^
xxh3.h:500:9: error: AltiVec argument passed to unprototyped function
It seems it doesn't know vec_revb()
?
This documentation seems to imply that vec_revb()
availability is tied to compilation flag -mcpu=power9
.
I tried it on Ubuntu Trusty, which is the environment used for PowerPC tests on TravisCI, but it doesn't work : this compiler is limited to power8
at best.
This path seems to imply that vec_revb()
was later back-ported to power8
, so it could work on this cpu too. Unfortunately, it is probably too late for the compiler version in Ubuntu Trusty : tried it on TravisCI, it still fails.
This makes me think that the automated detection of VSX
has quite a number of conditions to check to make sure compilation will be successful.
edit : and indeed, if I don't force the VSX
mode and let the detection macro do the work, it happily triggers the VSX
code path with -m64 -maltivec -mvsx
, and then fails due to absence of vec_revb()
.
edit 2 : I will disable VSX
tests on Travis CI for the time being. We still need PowerPC tests for big-endian compatibility.
Indeed. This was intended for POWER9, I don't know if it would work on POWER8.
I found that compiler support was rather lackkuster, especially on the GCC side. Only like GCC 7 or so supports it, and there don't seem to be a lot of macros to discern the versions. I will have to dig some more.
I know that Google used some inline assembly to work around the terrible compiler support and the inconsistent multiply intrisnics.
Granted, anyone who is using PowerPC64 nowadays is not likely to be the average user who doesn't know the difference between 32-bit and 64-bit or doesn't know what AVX2 is. But I still want to get things working properly.
With the release of streaming API for XXH3_64bits()
, there's only a very limited number of actions remaining before moving on to XXH128()
.
1) Following @aras-p investigation, it seems preferable to extend the "short mode" to longer sizes, in order to reduce the performance cliff related to "long mode" initialization. This has direct consequences on streaming's state, so that's why it needed to be done in this order. Also, preserving hash quality is non-trivial, but I believe I've got a solution too, with earlier move to "secret first" helping this objective.
2) Wasm performance is still an issue, and it seems that "unrolling" is actually bad for this target. But in order to fix this, I must be able to observe it first. I made a little progress this morning, as emscripten
compilation works again on my Mac. But compilation is just a first step : I still need to find a way to actually run a wasm test program and make measurements.
3) Lastly, I'm currently wondering about the introduction of a new property : it seems it could be possible to make XXH3
essentially bijective on len==8
, and injective for any len < 8
. It basically means that, for 2 inputs of same len <= 8, there would be a guarantee of no collision if both inputs are different (instead of the usual 1 / 2^64 probability). I'm wondering if this would be useful. The drawback is that, in order to keep 1st-class avalanche effect on any subrange of bits of the output, I need to add one multiplication, making the hash a little bit slower (for len <= 8). Neither the property nor the impact seem large, so it's more a matter of preference : which one seems to matter more ?
I'll have to think it through more, but my main use case for xxhash is as a hash code for hash table lookups (where the big advantage of a high quality 64bit+ hash code is that you can get away with skipping key equality comparisons in far more scenarios than you could with something like an FNV variant), and I'm pretty sure in at least some use cases I could leverage an explicit guarantee of no collisions for keys <= 8 bytes for far more savings than the cost of a single multiply. So currently my vote would be to make it bijective at the cost of a multiply.
I am curious how this would work with a seed-independent zero hash for zero-length input, though (which I also think is something I could leverage for an optimization on lookups); can you simultaneously guarantee both no collisions len <= 8 and seed independent zero hash for zero length?
I still need to find a way to actually run a wasm test program and make measurements
Hmm for me running stuff on Wasm was super simple, basically just:
emcc
just like gcc
or clang
, and it will produce a "command line" executable--preload-file paht/to/file
to be able to do fopen
/fread
of test data files if you need them,python -m SimpleHTTPServer
0.0.0.0:8000
) and open your resulting html file.My "compile wasm" build script is https://github.com/aras-p/HashFunctionsTest/blob/master/compile_emscripten.sh if you need an actual full example
can you simultaneously guarantee both no collisions len <= 8 and seed independent zero hash for zero length?
It depends on what you mean by "no collisions len <= 8".
Note that above-mentioned guarantee of no collision would be _for 2 inputs of same len <= 8_ . If that's what you meant, then yes, it's possible.
It is impossible to guarantee no collision for any 2 inputs of len <= 8, just due to the pigeon hole principle. Or in more mathematical terms, 2 sets of different size cannot be bijective.
A possible property would be to guarantee no collision for any 2 inputs of len < 8. In which case, the starting set is smaller than the destination one, so it's possible. Unfortunately, I have so far ruled out this property because it would be too easy to generate an intentional secret-independent collision between a input of len < 8, and one of len == 8. However, if that's a desirable property, I could look again to find a better solution.
Thanks for guidance @aras-p , I'll sure try this setup as soon as possible.
Ah, yeah, of course! That makes sense. For my purposes, no same len collisions <=8 is definitely better than no collisions len <8.
Thanks to @aras-p 's hints, I started benchmarks in WebAssembly mode this morning. I've only used one platform, a recent Mac laptop, featuring a recent Intel Core, and Chrome 74.
Using the "wide range" test, it mostly confirms what we already knew : XXH3 on wasm
is a bit faster than XXH32, but slower than XXH64. This is consistent with a non-vectorized scalar code path run on 64-bit cpu.
I'm afraid I don't see a simple and immediate solution. The longer-term hope is that vectorial instructions will start working in a future emcc
version, leading to auto-vectorization of the scalar code path, which currently works very well with clang
.
If that's too far away, or too unsure, another route could be to generate a new code path, using gcc
/clang
vectorial intrinsics, which are apparently supported by emcc
.
The wide range test doesn't allow me to observe the cliff at 128 bytes and 1 KB, because the distance between 2 consecutive measurement is too large (it's a power of 2). I will use the more accurate length-byte test to specifically target these areas, and make sure next changes improve wasm performance.
With last changes to dev
branch, I believe XXH3_64bits
is now "feature complete".
It incorporates all points mentioned above.
I don't plan any more modification, but obviously, the design is still opened to comments and improvements.
Now, all that remains to be done is ... to apply the same mechanisms to the 128-bit variant...
Yawn… I'm back.
School is finally over, and I have a lot of free time for the next few days.
I have to replace my G3 very soon because the screen is dying. It is still functional, but I don't know for how long. God, I suck at decisions. I also plan to replace the Dell tower later this summer.
@Cyan4973 So, what did I miss?
I was looking at the recent changes, and it looks like the ARM and VSX paths need to be updated.
Could you please give me a quick recap about the recent changes to the algorithm? That would make things a little easier.
By the way, speaking of decisions, have you decided on how/if we are going to do the dispatcher?
XXH3_64bits()
is stabilized, including its streaming implementation.
XXH3_128bits()
is not, and that's my next priority. The current proposal in XXH128()
does meet a few requirements, but constraints on performance seem too high, so I'll likely have to update it.
Hence, speaking of updates, prefer concentrating on XXH3_64bits()
. I don't remember if NEON and VSX paths are correctly tested / validated.
The dispatcher has a place. I see it more as part of xxhsum
than the library itself, but I can be mistaken in describing this separation.
You also proposed an interesting patch reducing instruction size, which looks good. It's been preserved in a feature branch, reroll. But the changes within dev
are so large that it's more complex to merge now.
The dispatcher has a place. I see it more as part of xxhsum than the library itself
Well, as I mentioned before, Google Play Store guidelines require a feature test for ARMv7 NEON support, and as for the AVX2/SSE2 paths, if anyone wants to use the AVX2 path, it is impractical if you ever plan to distribute without a dispatcher. People are idiots and wouldn't know which binary to use. 🤷♂
Considering how key reading has changed, it seems that PPC big endian needs an update. I'll do some testing either tomorrow or Friday.
As for the reroll branch, I suggest merge the XXH32/XXH64 changes for now and I'll try to work on XXH3 when I get a chance. Besides, it had a branching issue IIRC.
This is going to be a tracker for discussion, questions, feedback, and analyses about the new XXH3 hashes, found in the
xxh3
branch.@Cyan4973's comments (from
xxhash.h
):XXH3
is a new hash algorithm, featuring vastly improved speed performance for both small and large inputs.A full speed analysis will be published, it requires a lot more space than this comment can handle.
In general, expect
XXH3
to run about ~2x faster on large inputs, and >3x faster on small ones, though exact difference depend on platform.The algorithm is portable, will generate the same hash on all platforms. It benefits greatly from vectorization units, but does not require it.
XXH3
offers 2 variants,_64bits
and_128bits
. The first 64-bits field of the_128bits
variant is the same as_64bits
result. However, if only 64-bits are needed, prefer calling the_64bits
variant. It reduces the amount of mixing, resulting in faster speed on small inputs.The
XXH3
algorithm is still considered experimental. It's possible to use it for ephemeral data, but avoid storing long-term values for later re-use. While labelled experimental, the produced result can still change between versions.The API currently supports one-shot hashing only. The full version will include streaming capability, and canonical representation Long term optional feature may include custom secret keys, and secret key generation.
There are still a number of opened questions that community can influence during the experimental period. I'm trying to list a few of them below, though don't consider this list as complete.
XXH64()
(aka big-endian).XXH32
/XXH64
, but may be more natural for little-endian platforms.XXH128_hash_t
?XXH128_hash_t
which would be desirable ?_128bits
variant is the same as the result of_64bits
.XXH128_hash_t
, in ways which may block other possibilities.doubleSeed
).XXH128
?len==0
: Currently, the result of hashing a zero-length input is the seed.XXH32
/XXH64
).