Closed easyaspi314 closed 3 years ago
Canonical representation : for the 64-bit variant, canonical representation is the same as XXH64()
(aka big-endian).
Nothing is wrong with big endian output. I suggest for XXH3_64b
, it will look like this (big endian):
0123456789abcdef-3 file.c
and XXH3_128b
, this:
00112233445566778899aabbccddeeff-3 file.c
I say 2x big endian. That would be easiest to parse (in the case of being in a text file),
fscanf(hashfile, "%016llx%016llx-3", &hashLo, &hashHi);
and the easiest to output:
printf("%016llx%016llx-3\n", hashLo, hashHi);
Similar to how my XXH32a's canonical representation ended in -a
, I think it makes sense for XXH3 to end in -3
.
/* pointer or value? also, should we perhaps give a saturated subtraction for sorting
/ xxhash.h /
static inline bool operator==(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) == 0; } static inline bool operator!=(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) != 0; } static inline bool operator>(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) > 0; } static inline bool operator<(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) < 0; } static inline bool operator<=(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) <= 0; } static inline bool operator>=(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) >= 0; }
should we perhaps give a saturated subtraction for sorting
That's a great idea !
pointer or value?
I prefer by value when structure is of limited size.
That's the case here : XXH128_hash_t
is just 16-bytes long.
Regarding length == 0 return: I'd go for returning the seed (even if it is 0) instead of always returning 0. If one likes to differ the hashes of the same input data by providing a seed, you get still different hashes even if length == 0.
Pointer or value I'd vote for passing by pointer the XXH128_hash_t structure. Using value will imply copying for 32bit systems. Even on 64bit systems, passing 2 structs might not be done in registers.
BTW: Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".
regarding static U64 XXH3_mul128(U64 ll1, U64 ll2)
The Aarch32 code seems to be over-complicated for returning only the lower 64 bits.
GCC produces elegant and as as it looks like, correct code:
mul r3, r0, r3
mla r3, r2, r1, r3
umull r0, r1, r0, r2
add r1, r3, r1
It doesn't return the lower 64 bits, it adds the lower bits to the higher bits. This is pretty much the best way of doing it.
You can see it in the GCC extension code:
__uint128_t lll = (__uint128_t)ll1 * ll2;
return (U64)lll + (U64)(lll >> 64);
or if we break it up:
__uint128_t ll1_u128 = (__uint128_t) ll1; // cast to __uint128_t
__uint128_t ll2_u128 = (__uint128_t) ll2; // promoted
__uint128_t lll = ll1_u128 * ll2_u128; // long multiply
U64 lll_hi = (U64) (lll >> 64); // high bits
U64 lll_lo = (U64) (lll & 0xFFFFFFFFFFFFFFFFULL); // low bits
return lll_hi + lll_lo; // 64-bit add together
In x86_64 assembly, we get this:
_mul128:
mov rax, rsi
mul rdi # note: rax stores high bits now
add rax, rdx
ret
The reason I use inline assembly is because neither GCC nor Clang will emit the right code and prefer to spew out nonsense, and the __uint128_t
type is not available on 32-bit.
The code is quite efficient:
umull r12, lr, r0, r2
mov r5, #0
mov r4, #0
umaal r5, lr, r1, r2
umaal r5, r4, r0, r3
umaal lr, r4, r1, r3
adds r0, lr, r12
adc r1, r4, r5
TBH XXH3_mul128
should be renamed to something like XXH3_foldedMult64to128
to match XXH_mult32to64
to make it clear that this is a "folding" multiply and not just a superoptimized 64->128-bit multiply.
I also kinda agree about the pointer thing. On 32-bit, those two structs will take up 8 registers. ARM for example can only pass up to 4 in a function call before it has to push to the stack. And x86 only has 7 registers.
Actually, I benchmarked it via qsort, and it seems like passing value is in fact better on both 32-bit and 64-bit (for x86, haven't tested ARM yet):
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
typedef struct {
uint64_t ll1;
uint64_t ll2;
} XXH128_hash_t;
__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
if (a->ll2 == b->ll2) {
if (a->ll1 == b->ll1) {
return 0;
} else if (a->ll1 < b->ll1) {
return -1;
} else {
return 1;
}
}
if (a->ll2 < b->ll2) {
return -1;
} else {
return 1;
}
}
__attribute__((__noinline__)) int
XXH128_hash_cmp(const XXH128_hash_t a, const XXH128_hash_t b)
{
if (a.ll2 == b.ll2) {
if (a.ll1 == b.ll1) {
return 0;
} else if (a.ll1 < b.ll1) {
return -1;
} else {
return 1;
}
}
if (a.ll2 < b.ll2) {
return -1;
} else {
return 1;
}
}
int XXH128_hash_cmp_ptr_qsort(const void *start, const void *end)
{
return XXH128_hash_cmp_ptr((const XXH128_hash_t *)start, (const XXH128_hash_t *)end);
}
int XXH128_hash_cmp_qsort(const void *start, const void *end)
{
return XXH128_hash_cmp(*(const XXH128_hash_t *)start, *(const XXH128_hash_t *)end);
}
#define ROUNDS 1000000
static XXH128_hash_t arr[ROUNDS];
static XXH128_hash_t *arr_p[ROUNDS];
int main()
{
srand(time(NULL));
printf("arch: %zu-bit\n", sizeof(void *) * 8);
for (int i = 0; i < ROUNDS; i++) {
arr[i].ll1 = rand() | (uint64_t)rand() << 32;
arr[i].ll2 = rand() | (uint64_t)rand() << 32;
arr_p[i] = malloc(sizeof(XXH128_hash_t));
memcpy(arr_p[i], &arr[i], sizeof(XXH128_hash_t));
}
double start, end;
start = (double)clock();
qsort(arr, ROUNDS, sizeof(XXH128_hash_t), XXH128_hash_cmp_qsort);
end = (double)clock();
printf("pass by value: %lf\n", (end - start) / CLOCKS_PER_SEC);
start = (double)clock();
qsort(arr_p, ROUNDS, sizeof(XXH128_hash_t *), XXH128_hash_cmp_ptr_qsort);
end = (double)clock();
printf("pass by pointer: %lf\n", (end - start) / CLOCKS_PER_SEC);
}
arch: 32-bit
pass by value: 0.352646
pass by pointer: 0.526951
arch: 64-bit
pass by value: 0.198776
pass by pointer: 0.449999
(note: compiled with clang 7.0.1 on a 2.0 GHz 2nd Gen Core i7, -O3
)
GCC 8.3:
arch: 32-bit
pass by value: 0.442356
pass by pointer: 0.547503
arch: 64-bit
pass by value: 0.175436
pass by pointer: 0.360158
On ARM64 there are enough registers for ABI to pass it as value (r0..r7), but for 32Bit ARM the pointer-version is better as there are only 4 registers. So it highly depends on the CPU and the ABI. Edit: I checked, x64-GCC ABI uses four registers, so just enough to pass the two structs as value.
Nevermind that, this is the way to do it:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
typedef struct {
uint64_t ll1;
uint64_t ll2;
} XXH128_hash_t;
__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
if (a->ll2 == b->ll2) {
if (a->ll1 == b->ll1) {
return 0;
} else if (a->ll1 < b->ll1) {
return -1;
} else {
return 1;
}
}
if (a->ll2 < b->ll2) {
return -1;
} else {
return 1;
}
}
__attribute__((__noinline__)) int
XXH128_hash_cmp(const XXH128_hash_t a, const XXH128_hash_t b)
{
if (a.ll2 == b.ll2) {
if (a.ll1 == b.ll1) {
return 0;
} else if (a.ll1 < b.ll1) {
return -1;
} else {
return 1;
}
}
if (a.ll2 < b.ll2) {
return -1;
} else {
return 1;
}
}
int XXH128_hash_cmp_ptr_qsort(const void *start, const void *end)
{
return XXH128_hash_cmp_ptr(*(const XXH128_hash_t **)start, *(const XXH128_hash_t **)end);
}
int XXH128_hash_cmp_qsort(const void *start, const void *end)
{
return XXH128_hash_cmp(*(const XXH128_hash_t *)start, *(const XXH128_hash_t *)end);
}
#define ROUNDS 10000000
static XXH128_hash_t arr[ROUNDS];
static XXH128_hash_t *arr_p[ROUNDS];
static XXH128_hash_t arr_p_alt[ROUNDS];
int main()
{
srand(time(NULL));
printf("arch: %zu-bit\n", sizeof(void *) * 8);
for (int i = 0; i < ROUNDS; i++) {
arr[i].ll1 = rand() | (uint64_t)rand() << 32;
arr[i].ll2 = rand() | (uint64_t)rand() << 32;
arr_p[i] = malloc(sizeof(XXH128_hash_t));
memcpy(arr_p[i], &arr[i], sizeof(XXH128_hash_t));
memcpy(&arr_p_alt[i], &arr[i], sizeof(XXH128_hash_t));
}
double start, end;
start = (double)clock();
qsort(arr, ROUNDS, sizeof(XXH128_hash_t), XXH128_hash_cmp_qsort);
end = (double)clock();
printf("pass by value: %lf\n", (end - start) / CLOCKS_PER_SEC);
start = (double)clock();
qsort(arr_p, ROUNDS, sizeof(XXH128_hash_t *), XXH128_hash_cmp_ptr_qsort);
end = (double)clock();
printf("pass by pointer: %lf\n", (end - start) / CLOCKS_PER_SEC);
start = (double)clock();
qsort(arr_p_alt, ROUNDS, sizeof(XXH128_hash_t), (int (*)(const void *, const void *)) XXH128_hash_cmp_ptr);
end = (double)clock();
printf("direct pass to qsort: %lf\n", (end - start) / CLOCKS_PER_SEC);
}
(I increased the count to make it more noticable)
arch: 32-bit
pass by value: 4.095442
pass by pointer: 5.924909
direct pass to qsort: 3.096856
arch: 64-bit
pass by value: 2.489569
pass by pointer: 4.165021
direct pass to qsort: 2.178582
Use the pointer, and to use it for qsort
, say
/* To use it as a comparator for qsort, bsearch, or the like, the recommended way to use this is to
* cast to int (*)(const void*, const void*). This has the best performance.
* qsort(array, size, sizeof(XXH128_hash_t), (int (*)(const void*, const void*)) XXH128_hash_cmp)
* This assumes an array of XXH128_hash_t values. */
Because a compare function for pointers is going to be the most important usage of this, as I presume it will be used in stuff like qsort
and bsearch
.
As for the C++ version, we are going to want to inline the comparisons, as it makes std::sort
blazing fast:
static inline bool operator ==(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
return (a.ll1 == b.ll1) && (a.ll2 == b.ll2);
}
static inline bool operator <(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
if (a.ll2 == b.ll2) {
return a.ll1 < b.ll1;
} else {
return a.ll2 < b.ll2;
}
}
static inline bool operator !=(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
return !(a == b);
}
static inline bool operator >(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
return b < a;
}
static inline bool operator <=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
return !(a > b);
}
static inline bool operator >=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
return !(a < b);
}
And by passing by &
, we are also passing by pointer. It is just syntax sugar.
arch: 32-bit
pass by value: 4.056218
pass by pointer: 5.928958
direct pass to qsort: 3.117807
std::sort: 2.620486
std::sort inlined: 2.486056
arch: 64-bit
pass by value: 2.375233
pass by pointer: 4.089312
direct pass to qsort: 2.440806
std::sort: 1.826079
std::sort inlined: 1.583563
Edit: the first std::sort uses this:
static inline bool operator<(const XXH128_hash_t &a, const XXH128_hash_t &b) {
return XXH128_hash_cmp_ptr(&a, &b) < 0;
}
Even better, and this is all C++11 compatible constexpr which while it isn't that important, it seems to help optimization a bit.
int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
if (a->ll2 == b->ll2) {
return (a->ll1 < b->ll1) ? -1 : (a->ll1 > b->ll1);
}
if (a->ll2 < b->ll2) {
return -1;
} else {
return 1;
}
}
static inline constexpr bool
operator ==(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
return (a.ll1 == b.ll1) && (a.ll2 == b.ll2);
}
static inline constexpr bool
operator<(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
#if (defined(_WIN32) || defined(__LITTLE_ENDIAN__)) && (defined(__SIZEOF_INT128__) || (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128))
/* convert to uint128_t and compare */
return (a.ll1 | (static_cast<__uint128_t>(a.ll2) << 64))
< (b.ll1 | (static_cast<__uint128_t>(b.ll2) << 64));
#else
return (a.ll2 == b.ll2) ? (a.ll1 < b.ll1) : (a.ll2 < b.ll2);
#endif
}
static inline constexpr bool
operator !=(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
return !(a == b);
}
static inline constexpr bool
operator >(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
return b < a;
}
static inline constexpr bool
operator <=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
return !(a > b);
}
static inline constexpr bool
operator >=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
return !(a < b);
}
It is incredibly fast. Unfortunately, the uint128_t compare for the C-style sorting isn't as good as the other method.
arch: 32-bit
pass by value: 4.012220
direct pass to qsort: 3.027623
std::sort: 2.359985
std::sort by value: 2.335475
std::sort new: 2.096199
arch: 64-bit
pass by value: 2.185738
pass to qsort, u128_t compare: 2.127638
direct pass to qsort: 2.044731
std::sort: 1.629353
std::sort by value: 1.629171
std::sort new: 1.201221
On x86_64 with clang, operator <
compiles to this:
__ZltRK13XXH128_hash_tS1_: ## @operator<(XXH128_hash_t const&, XXH128_hash_t const&)
mov rax, qword ptr [rdi]
mov rcx, qword ptr [rdi + 8]
cmp rax, qword ptr [rsi]
sbb rcx, qword ptr [rsi + 8]
setb al
ret
and on i386 it compiles to this:
__ZltRK13XXH128_hash_tS1_: ## @operator<(XXH128_hash_t const&, XXH128_hash_t const&)
push ebp
push ebx
push edi
push esi
mov esi, dword ptr [esp + 24]
mov ecx, dword ptr [esp + 20]
mov eax, dword ptr [ecx + 8]
mov edx, dword ptr [ecx + 12]
mov ebx, dword ptr [esi + 8]
mov edi, dword ptr [esi + 12]
mov ebp, edx
xor ebp, edi
mov esi, eax
xor esi, ebx
or esi, ebp
jne LBB5_2
mov eax, dword ptr [ecx]
mov ecx, dword ptr [ecx + 4]
mov edx, dword ptr [esp + 24]
cmp eax, dword ptr [edx]
sbb ecx, dword ptr [edx + 4]
jmp LBB5_3
LBB5_2:
cmp eax, ebx
sbb edx, edi
LBB5_3:
setb al
pop esi
pop edi
pop ebx
pop ebp
ret
Edit: GCC 8.3
arch: 32-bit
pass by value: 4.906234
direct pass to qsort: 3.260699
std::sort: 2.868732
std::sort by value: 2.862184
pass by pointer: 1.626228
arch: 64-bit
pass by value: 2.009833
pass to qsort, u128_t compare: 2.098546
direct pass to qsort: 1.861579
std::sort: 1.432066
std::sort by value: 1.489022
pass by pointer: 1.113134
Regarding static U64 XXH3_mul128(U64 ll1, U64 ll2) The Aarch32 code seems to be over-complicated for returning only the lower 64 bits.
As mentioned by @easyaspi314 , it's bit a more than just returning the lower bits. As the high bits get folded into the low bits, it might be possible to write this operation more compactly, and not need a "real" 128-bits multiplication. However, I did not figure out the math so far.
I'm fine with a rewrite of the function if it can lead to better performance.
If some gcc
assembly can help reach this purpose, it's all fine.
XXH3_mul128 should be renamed to something like XXH3_foldedMult64to128
Good point
Consequences of implementing a comparator :
In saying that one hash value is "bigger" than the other one, it makes XXH128_hash_t
more than a bit field. It is now necessary to distinguish its components, as they are no longer equal : one must be "high", and the other "low".
And now the order matters. Which one is high, which one is low ?
Possible answer :
In keeping the convention that internal components of XXH128_hash_t
are XXH64_hash_t
, for which the convention is big-endian, it feels consistent to continue using the same convention, making the first component "high", and the second "low".
Thoughts ?
Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".
Yes, I agree. If we are implementing a comparator function, components must be distinguished as "high" or "low".
In keeping the convention that internal components of
XXH128_hash_t
areXXH64_hash_t
, for which the convention is big-endian, it feels consistent to continue using the same convention, making the first component "high", and the second "low". Thoughts ?
I'd go for this. This would be compatible with a natural uint128_t if it exists.
Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".
Yes, I agree. If we are implementing a comparator function, components must be distinguished as "high" or "low".
Also if you consider the struct as replacement for uint128_t
a natural
uint128_t
if it exists
// 0x00112233445566778899AABBCCDDEEFF
const __uint128_t x = ((__uint128_t)0x0011223344556677ULL<< 64) | 0x8899AABBCCDDEEFFULL;
$ clang -O3 -c test.c
$ hexdump -C test.o
...
00000180 ff ee dd cc bb aa 99 88 77 66 55 44 33 22 11 00 |........wfUD3"..|
...
For uint128_t 0x00112233445566778899AABBCCDDEEFF
, the "natural" layout for most machines would be 0xFFEEDDCCBBAA99887766554433221100
.
Doing it this way would allow someone to theoretically do this on a 64-bit little endian machine:
XXH128_hash_t hash = XXH3_128b(...);
__uint128_t val;
memcpy(&val, &hash, sizeof(__uint128_t));
Also if you consider the struct as replacement for uint128_t
It is supposed to be like one, and if the C standard had 128-bit integers, we would definitely use them. But it doesn't, and the only platforms that have it are 64-bit.
Yes, that's indeed the problem. At a binary level :
{ hi, lo }
would have same byte order as __uint128_t
on a big endian machine.{ lo, hi }
would have same byte order as __uint128_t
on a little endian machine.
So it's not possible to please both.At a canonical level, I believe it makes sense to preserve the "natural" byte order, as if a __uint128_t
could be displayed directly on screen. It makes "big endian" a natural choice. It follows the convention already established by XXH64_hash_t
.
But binary and canonical levels do not have to match. Later on, a pair of conversion functions, such as XXH128_canonicalFromHash()
will ensure that representation is always correct, whatever the platform.
So let's concentrate on binary level now, It would seem a matter to select which platform is more important, little or big endian. And these days, little endian is pretty much a winner.
That being said, even that does not matter much : __uint128_t
is not an official type, and it's doubtful if it will ever be in the foreseeable future. So a binary compatibility (which would be limited to little endian platforms) seems of doubtful usefulness.
Consistency is also a topic.
{ hi, lo }
feels to match the "natural" order expect by humans. That's how we write numbers !
I'm not opposed to { lo, hi }
, I just suspect a few people will find it strange to read low
before high
.
But at the end of the day, both solutions work. As long as low
and high
are clearly expressed, it should be usable.
It's mostly a matter of picking one and sticking to it.
I'm fine with both. Just trying to find a reason to "prefer" one over the other.
I went ahead and proceeded with renaming XXH128_hash_t
members into low64
and high64
.
This will hopefully makes it clearer for comparison purposes.
I used the little-endian binary convention, hence low
first. I still don't have a strong reason to prefer one over the other, so I went with the easiest one.
I also believe it's not opposed to using big-endian for the canonical representation.
"by value" vs "by reference" :
I would really not feel too much concerned by a memcpy(,,16)
operation. From a performance perspective, this should be unnoticeable. I have actually read reports that it's better from a cache locality perspective, as pointers into different memory areas will maintain active more cache lines, thus degrading system performance.
Moreover, whenever the comparator function can be inlined, it all becomes moot. One can expect the comparison to be performed directly on source data. Btw, maybe it's important to offer an inlinable comparator.
In contrast, I would feel more concerned by branches, which can be costly when they mispredict.
Hopefully, in this case, since 64-bit fields are compared, the likelihood of finding hashes with equal high64
fields becomes very low, making the branch extremely predictable.
If we do not constrain the comparator to return { -1, 0, 1 }
, but more generally { <0, ==0, >0 }
, we may be able to reduce the nb of branches in the comparator.
In this godbolt example, there is a comparison between a classical "branchy" comparator and a branchless one.
On 64-bit, the branchless variant looks good. It uses less instructions than the "branchy" one and avoids a cmov
. I would expect it to run faster.
There is also a fully branchless comparator proposed at compare128_1()
. It uses the same idea.
However, as mentioned before, it's unlikely that the high 64 bits of XXH128_hash_t
are identical.
If that happens, it probably means that hashed object was effectively identical.
As predictable branches are extremely cheap, adding one to reduce amount of work is probably a good idea.
compare128_2()
uses that fact to introduce a single branch, between high and low comparisons.
Now it's a matter of comparing these implementations...
I've used @easyaspi314 qsort
benchmark script, and adapted it to measure differences between comparators. Here is the modified source code : https://gist.github.com/Cyan4973/e1baa1a9a32b690508df76be64b34082
Some conclusions : I couldn't find any significant difference between the comparators. They all end up "within noise margin". I guess the comparator is only a small part of the sorting operation itself, so it doesn't matter much.
One indirection vs 2 indirections make a difference, though a moderate one (~10%).
Indeed, invoking the byPtr
comparator directly is faster than invoking a wrapper using ptr arguments which itself calls the byValue
comparator while ensuring this one is not inlined.
If it sounds like a win for byPtr
, it's only because the test case mandates the use of byPtr
. And when the byValue
function can be inlined into the wrapper, it erases all differences.
So, the byPtr
variant is a better fit for qsort
, but is that an isolated example ?
I still believe that passing arguments by value feel more natural,
and that inlining is actually the most important key to performance.
There's an argument though for providing a comparator function "ready to use" with qsort
. Even if isolated, it's still an important use case by itself.
typedef union {
struct {
uint64_t ll1;
uint64_t ll2;
};
struct {
/* don't use it directly! */
size_t __val[(sizeof(uint64_t)*2)/sizeof(size_t)];
};
} XXH128_hash_t;
int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
#if defined(__LITTLE_ENDIAN__) || defined(_WIN32)
/* Only compare the word sizes. This only works on little endian. */
int cmp;
size_t i = (sizeof(a->__val)/sizeof(a->__val[0]));
while (i-- > 0) {
cmp = (a->__val[i] > b->__val[i]) - (a->__val[i] < b->__val[i]);
if (cmp) {
return cmp;
}
}
return cmp;
#else
/* normal comparison, don't really care */
#endif
}
That has the best performance on 32-bit.
XXH128_hash_cmp_ptr_alt:
mov rax, qword ptr [rsi+0x8]
cmp qword ptr [rdi+0x8], rax
je L7
sbb eax, eax
or eax, 0x1
L1:
ret
L7:
mov rdx, QWORD PTR [rsi]
mov eax, 0x1
cmp QWORD PTR [rdi], rdx
ja L1
sbb eax, eax
ret
That clever bit of assembly has the best I've seen on 64-bit, and it is what GCC emits with the naive implementation:
__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
if (a->high64 == b->high64) {
if (a->low64 > b->low64) {
return 1;
} else if (a->low64 < b->low64) {
return -1;
} else {
return 0;
}
}
if (a->high64 < b->high64) {
return -1;
} else {
return 1;
}
}
As for qsort: I guess the most impact comes from swaping the 128bit values if sorting does not use an array of pointers. So for machines which have enough memory to hold thousands of hashes, the advantage of the one or the other way to call the compare function disappears in the noise.
Will there be createState/update/freeState APIs for XXH3?
Will there be createState/update/freeState APIs for XXH3?
Yes, that's basically next step.
Have you compared it with SeaHash
I wasn't aware of this hash, thanks for the link !
From its description, it says that it's a bit faster than XXH64
, and that it's aimed at checksumming. Hence it's probably not optimized for small inputs. Also, 10% faster than XXH64
seems not enough to compete on speed with XXH3
on large inputs.
https://github.com/Cyan4973/xxHash/issues/180
There is currently a huge bug in the implementation, so some parts need to be redesigned.
I have to disagree with the mention of "huge bug" here. I think we do not share the same view of the scope of a non-cryptographic hash function.
I maintain that a non-cryptographic hash should not even try to "resist" an attack, by adding layers of obfuscations that will just be defeated later on. In the meantime, there will always be a temptation to think "I don't understand what's going on in this algorithm; sure, it's labelled non-cryptographic, but (wink) it seems it's almost as good ", then it gets used in places where it should not, and a bit later, we've got an effective attack ready to roll on all these brittle implementations.
A non cryptographic algorithm should not hide the fact that it's non cryptographic, not even in its implementations. In the case of XXH3
, it's trivial to figure out how to intentionally generate collisions, on condition of knowing the secret key and the seed. So sure, if one targets the default key and the default seed, it's just plain simple. For other algorithms, it may take a study to get there, but once the method is known, it's only a few clicks away, and infinite collision generation is a matter of a simple script.
The objective of a non-cryptographic hash algorithm is just to quickly generate a bunch of bits, to be used in a hash table, bloom filter, etc. The changes I'm implementing are aimed at the space reduction problem, which is far from severe. They will, as a side effect, make it a bit more difficult to generate intentional collisions, but it certainly does not make the algorithm any more secure, and is not the goal. From a user perspective, the impact of this space reduction improvement is going to be minimal, if not completely unnoticeable. The way I see it, we are tuning (and hopefully improving) the quality / performance trade-off. Which is precisely the point of this test period.
I commented on the announcement blog post, but this seems a more appropriate place to put this... I'm attempting a C# port here: https://github.com/Zhentar/xxHash3.NET/
I'm unfortunately finding it unexpectedly difficult just to match the performance of xxHash64 for large keys - even the SSE2 version is only on par with xxHash32
Unfortunately, C# is not my strong point. So I can't provide straightforward hints.
I wouldn't be surprised if there is a need to use some kind of "unsafe" interface, in order to remove a bunch of automated silent checks, which end up being quite significant in a hot loop. How to do it is outside of my sphere of competence. Maybe looking at the source code of a few other xxHash projects in C# might help.
Figured out how to get MSVC to spit out an optimized build of xxHash... maybe it's my hardware (i7-6600U), not my implementation:
XXH32 : 1000000 -> 5928 it/s ( 5653.5 MB/s)
XXH32 unaligned : 1000000 -> 6058 it/s ( 5777.1 MB/s)
XXH64 : 1000000 -> 11924 it/s (11372.1 MB/s)
XXH64 unaligned : 1000000 -> 11917 it/s (11365.2 MB/s)
XXH3_64bits : 1000000 -> 4778 it/s ( 4556.4 MB/s)
XXH3_64b unaligned : 1000000 -> 4082 it/s ( 3892.5 MB/s)
Just my 2cents: I come from safety not security, so I am looking for non-cryptographic hashes which at best outperform CRC32 in both speed and collision-freenes. It does not matter if one can generate collisions by knowing the seed or the key. But I want to be sure, that a bunch of changed bits will not lead to a good hash.
Replacing CRC32(C) which is currently used to verify files integrity is my goal too.
@Zhentar : I my experience, clang
is way better at optimizing vectorized code,
gcc
is a distance second, generally not bad at SSE2
but definitively worse for AVX2
and NEON
,
then msvc
, which is lagging by a lot.
I don't have an i7-6600u platform to test on these days. If there is a possibility of a remote access, I'll be glad to investigate. In the meantime, I would recommend you to test clang
. On Windows platform, the way I would proceed is to install the msys2 + mingw64
subsystem, then use the integrated package manager to install clang
.
I will try to post a Windows binary of xxhsum
in the release tab. With regards to XXH3
, the only thing this Windows binary will offer is the benchmark module. Hopefully, it might be enough to get a sense of the target performance.
A known good reference build would be very helpful (even if it's a Linux build, though I would prefer Windows).
Unfortunately, the other C# xxHash ports did not have much to offer me in regards to optimization tips...
Method | ByteLength | Mean | Throughput |
---|---|---|---|
Zhent_xxHash32 | 15 | 10.61 ns | 1,348.5 MB/s |
xxHashSharp_xxHash32 | 15 | 10.96 ns | 1,305.1 MB/s |
Zhent_xxHash64 | 15 | 13.55 ns | 1,056.1 MB/s |
NeoSmart_xxHash64 | 15 | 114.19 ns | 125.3 MB/s |
yyproj_xxHash64 | 15 | 199.12 ns | 71.8 MB/s |
HashFunctions_xxHash64 | 15 | 317.77 ns | 45.0 MB/s |
core20_xxHash64 | 15 | 531.48 ns | 26.9 MB/s |
Zhent_xxHash64 | 1000000 | 82,298.71 ns | 11,588.0 MB/s |
Zhent_xxHash32 | 1000000 | 144,697.68 ns | 6,590.8 MB/s |
yyproj_xxHash64 | 1000000 | 587,337.64 ns | 1,623.7 MB/s |
HashFunctions_xxHash64 | 1000000 | 1,093,433.89 ns | 872.2 MB/s |
xxHashSharp_xxHash32 | 1000000 | 1,118,007.87 ns | 853.0 MB/s |
NeoSmart_xxHash64 | 1000000 | 2,135,509.68 ns | 446.6 MB/s |
core20_xxHash64 | 1000000 | 5,708,470.65 ns | 167.1 MB/s |
(note that this is a different machine than my previous post, I don't think my xxHash32 is actually meaningfully faster than yours)
I added 2 Windows binaries to the release tab,
one gcc + SSE2
, and one clang + AVX2
.
XXH3
is accessible through the benchmark module -b
,
it will only provide speed numbers, not the hash value itself.
Thanks! That definitely helps. My initial impression is that clang's speed (at least with AVX2) comes from completely unrolling & inlining the block stripe accumulation loops, so that all of the key indexing can use immediate offsets.
I wonder, how does reducing the keyset size to 32 influence things for you? 8 stripes per block instead of 16 seems like it could be more friendly to less aggressive compilers.
I tried this suggestion of reducing the key size to reduce the loop size,
in the experience it made gcc-8
slightly better when using -march=native
(but not -mavx2
),
but clang
became worse.
As a side-effect , both versions had similar performance, though clang
was still leading by a small margin.
So it's not entirely win-win nor lose-lose, but since it reduces top performance by ~7%, I guess current key size and associated loop size still seem preferable.
clang's speed (at least with AVX2) comes from completely unrolling & inlining the block stripe accumulation loops, so that all of the key indexing can use immediate offsets.
did you checked that in the assembly ?
Yep. It's pretty impressive - all of XXH3_hashLong has just two dozen scalar instructions, with only three in the block loop.
But I managed to get my loop pretty close, with just six!
Method | ByteLength | Mean | Throughput |
---|---|---|---|
XxHash3AVX2 | 524288 | 16.25 us | 30,772.7 MB/s |
(clang's: XXH3_64b unaligned : 102400 -> 385931 it/s (37688.5 MB/s)
)
I think I know how to shrink that gap a touch, but it's going to make some already hideous code look even worse 😩
Build FAILED.
"/Users/user/xxHash3.NET/xxHash3.NET.sln" (default target) (1) ->
"/Users/user/xxHash3.NET/xxHash3/xxHash3.csproj" (default target) (2) ->
"/Users/user/xxHash3.NET/xxHash3/xxHash3.csproj" (Build target) (2:2) ->
(CoreCompile target) ->
xxHash3_SSE2.cs(3,22): error CS0234: The type or namespace name 'Intrinsics' does not exist in the namespace 'System.Runtime' (are you missing an assembly reference?) [/Users/user/xxHash3.NET/xxHash3/xxHash3.csproj]
xxHash3_SSE2.cs(4,22): error CS0234: The type or namespace name 'Intrinsics' does not exist in the namespace 'System.Runtime' (are you missing an assembly reference?) [/Users/user/xxHash3.NET/xxHash3/xxHash3.csproj]
Well, TIL, Mono doesn't have System.Runtime.Intrinsics
.
😡
Hah, yeah, .NET Core 3.0-preview only at present. Did I screw up my #if
define checks?
I had to fiddle with the config files so idk
either way, doesn't look like I'm trying it on my Mac today…
.NET Core is cross platform, macOS included 😃 You can grab the 3.0 preview here: https://dotnet.microsoft.com/download/dotnet-core/3.0 and then it should just be dotnet build -c Release
to run a CLI build.
~/xxHash3.NET/xxHash3.Benchmarks/bin/Release/netcoreapp3.0 $ sudo ./xxHash3.Benchmarks
Password:
// Validating benchmarks:
// ***** BenchmarkRunner: Start *****
// ***** Found 1 benchmark(s) in total *****
// ***** Building 1 exe(s) in Parallel: Start *****
// ***** Done, took 00:00:00 (0.05 sec) *****
// Found 1 benchmarks:
// LongKeyTests.XxHash3AVX2: InProcess(MaxRelativeError=0.05, EnvironmentVariables=COMPlus_TieredCompilation=0, Toolchain=InProcessToolchain) [ByteLength=524288]
// **************************
// Benchmark: LongKeyTests.XxHash3AVX2: InProcess(MaxRelativeError=0.05, EnvironmentVariables=COMPlus_TieredCompilation=0, Toolchain=InProcessToolchain) [ByteLength=524288]
// *** Execute ***
// Launch: 1 / 1
// Benchmark Process Environment Information:
// Runtime=.NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
// GC=Concurrent Workstation
// Job: InProcess(MaxRelativeError=0.05, EnvironmentVariables=COMPlus_TieredCompilation=0, Toolchain=InProcessToolchain)
OverheadJitting 1: 1 op, 415857.00 ns, 415.8570 us/op
Unhandled Exception: System.InvalidOperationException: Benchmark LongKeyTests.XxHash3AVX2: InProcess(MaxRelativeError=0.05, EnvironmentVariables=COMPlus_TieredCompilation=0, Toolchain=InProcessToolchain) [ByteLength=524288] takes to long to run. Prefer to use out-of-process toolchains for long-running benchmarks.
at BenchmarkDotNet.Toolchains.InProcess.InProcessExecutor.Execute(ExecuteParameters executeParameters)
at BenchmarkDotNet.Running.BenchmarkRunnerClean.RunExecute(ILogger logger, BenchmarkCase benchmarkCase, BenchmarkId benchmarkId, IToolchain toolchain, BuildResult buildResult, IResolver resolver, IDiagnoser diagnoser, Boolean& success)
at BenchmarkDotNet.Running.BenchmarkRunnerClean.Execute(ILogger logger, BenchmarkCase benchmarkCase, BenchmarkId benchmarkId, IToolchain toolchain, BuildResult buildResult, IResolver resolver)
at BenchmarkDotNet.Running.BenchmarkRunnerClean.RunCore(BenchmarkCase benchmarkCase, BenchmarkId benchmarkId, ILogger logger, IResolver resolver, BuildResult buildResult)
at BenchmarkDotNet.Running.BenchmarkRunnerClean.Run(BenchmarkRunInfo benchmarkRunInfo, Dictionary`2 buildResults, IResolver resolver, ILogger logger, List`1 artifactsToCleanup, String rootArtifactsFolderPath, StartedClock& runChronometer)
at BenchmarkDotNet.Running.BenchmarkRunnerClean.Run(BenchmarkRunInfo[] benchmarkRunInfos)
at BenchmarkDotNet.Running.BenchmarkRunner.RunWithDirtyAssemblyResolveHelper(Type type, IConfig config)
at BenchmarkDotNet.Running.BenchmarkRunner.Run[T](IConfig config)
at xxHash3.Program.Main() in /Users/user/xxHash3.NET/xxHash3.Benchmarks/Program.cs:line 31
Abort trap: 6
🤔
I don't have AVX2 (I have Sandy Bridge), but I changed it to UseSse2
…
It's not like it's doing anything, it only is doing about 3% CPU.
Managed to close the gap quite a bit! Manually mimicked some of the optimizations clang made, along with realizing that using a 512kb test buffer on a processor with a 512kb L2 cache was maybe not without downsides.
Method | ByteLength | Mean | Error | StdDev | Throughput |
---|---|---|---|---|---|
XxHash3AVX2 | 102400 | 2.686 us | 0.0337 us | 0.0281 us | 36,354.0 MB/s |
XxHash3AVX2 | 524288 | 15.269 us | 0.2233 us | 0.2089 us | 32,745.5 MB/s |
I also realized Clang did a bit more than I realized - it also completely de-shingled the keys array to get all the keys loads aligned, and even pre-computed the key shuffle in the accumulator scramble.
@easyaspi314
Hmmm. The next line you're supposed to see is this:
OverheadJitting 1: 1 op, 314000.00 ns, 314.0000 us/op
WorkloadJitting 1: 1 op, 14012400.00 ns, 14.0124 ms/op
Which means it's freezing while trying to JIT compile the code 😬 I haven't transferred any of my AVX2 learnings to the SSE2 code yet though, so that part isn't so fast yet anyway.
Clang is pretty clever. Usually. GCC too (although it has its weaknesses).
MSVC stands true to its name: MSVC Sucks at Vectorizing Code
A lot of the times when I am optimizing SIMD code, I use the compiler output from GCC and Clang targeting different arches, and apply them back and forth. So, I get it.
314000.00 ns, 314.0000 us/op
I see what you did there 😆
I'm no C# developer, so I don't know how much help I can be.
Any news?
I'm completing a 64-bit collision analyzer. It's able to generate and detect collisions using enough hashes to provide meaningful results and therefore judge the quality of any 64-bit hash algorithm. The default test generates 24 billions hashes, and needs at least 64 GB of RAM to run properly.
The tool will make it possible to check that XXH3_64bits()
has indeed the collision resistance of an "ideal" 64-bit hash algorithm.
It will be also used later in the refactoring of XXH3_128bits()
, to prove that it has better collision resistance than 64-bits variants.
Unfortunately, it will not be able to directly measure the collision rate for 128 bits hashes, as the amount of resources needed is, well, unreasonable.
This is going to be a tracker for discussion, questions, feedback, and analyses about the new XXH3 hashes, found in the
xxh3
branch.@Cyan4973's comments (from
xxhash.h
):XXH3
is a new hash algorithm, featuring vastly improved speed performance for both small and large inputs.A full speed analysis will be published, it requires a lot more space than this comment can handle.
In general, expect
XXH3
to run about ~2x faster on large inputs, and >3x faster on small ones, though exact difference depend on platform.The algorithm is portable, will generate the same hash on all platforms. It benefits greatly from vectorization units, but does not require it.
XXH3
offers 2 variants,_64bits
and_128bits
. The first 64-bits field of the_128bits
variant is the same as_64bits
result. However, if only 64-bits are needed, prefer calling the_64bits
variant. It reduces the amount of mixing, resulting in faster speed on small inputs.The
XXH3
algorithm is still considered experimental. It's possible to use it for ephemeral data, but avoid storing long-term values for later re-use. While labelled experimental, the produced result can still change between versions.The API currently supports one-shot hashing only. The full version will include streaming capability, and canonical representation Long term optional feature may include custom secret keys, and secret key generation.
There are still a number of opened questions that community can influence during the experimental period. I'm trying to list a few of them below, though don't consider this list as complete.
XXH64()
(aka big-endian).XXH32
/XXH64
, but may be more natural for little-endian platforms.XXH128_hash_t
?XXH128_hash_t
which would be desirable ?_128bits
variant is the same as the result of_64bits
.XXH128_hash_t
, in ways which may block other possibilities.doubleSeed
).XXH128
?len==0
: Currently, the result of hashing a zero-length input is the seed.XXH32
/XXH64
).