Open Cyan4973 opened 4 years ago
Another one I was thinking was an option to disable streaming.
The streaming API takes up a good chunk of the binary size:
Clang 10.0.1, aarch64, Termux
$ clang -Oz -shared -fPIC xxhash.c -s -o libxxh_stream.so
$ clang -Oz -shared -fPIC xxhash.c -s -o libxxh_nostream.so -DXXH_NO_STREAM
$ size -A *.so | grep -E '(:|Total)'
libxxh_nostream.so :
Total 7219
libxxh_stream.so :
Total 12848
Edit: -O3
:
libxxh_nostream.so :
Total 21966
libxxh_stream.so :
Total 35067
Another cheat optimization is to use __attribute__((__pure__))
on as many xxHash functions as possible.
It says, roughly:
Basically, __pure__
is the magic behind the strlen
optimization. (Excluding the second optimization where the compiler treats it as a built-in function and inlines/const props it)
const char *bad_strchr(const char *s, int c)
{
for (size_t i = 0; i < strlen(s) /* EEK */; i++) {
if ((unsigned char)s[i] == (unsigned char)c) {
return &s[i];
}
}
return NULL;
}
Any decent compiler will change it to this:
const char *bad_strchr(const char *s, int c)
{
const size_t len = strlen(s); // strlen is pure, we only need to call it once
for (size_t i = 0; i < len; i++) {
if ((unsigned char)s[i] == (unsigned char)c) {
return &s[i];
}
}
return NULL;
}
Although it is easiest to see with this code:
size_t strlenx2(const char *s)
{
return strlen(s) + strlen(s);
}
Equivalent code with how Clang shuffles registers on x86_64:
// optimized
size_t strlenx2(const char *s)
{
size_t len = strlen(s);
len += len;
return len;
}
// unoptimized
size_t strlenx2(const char *s)
{
size_t tmp_len; // r14
const char *tmp_s; // rbx
tmp_s = s;
size_t len = strlen(s);
tmp_len = len;
s = tmp_s;
len = strlen(s);
len += tmp_len;
return len;
}
This could possibly improve performance on some hash tables depending on how they are used. Primarily thinking of code that looks like this:
table[key].foo = "Foo";
table[key].bar = "Bar";
Note that the compiler sometimes can figure this out on its own if xxHash is inlined, but this applies to both inline and extern functions.
I would have an option added to explicitly disable it for a fair benchmark.
Other ideas:
XXH3_64bits
to drop in as a replacement for common std::hash
uses in C++std::is_trivial
, std::has_unique_object_representations
in C++17, check len % sizeof(T)
)noexcept
/__attribute__((__nothrow__))
specifier since the single shot functions will never throw an exception allowing the compiler to leave out unwind tables and shrink code size a bit.Yes, I like pure functions, so I'm all for it.
Also note the existence of const
functions in gcc
, with an even stricter set of restrictions.
In general, I would expect -O3
to spend enough cpu to discover which functions are pure, so it's unclear if xxhash
will receive a measurable boost to performance with these function qualifiers,
that being said, I like them even if the only impact is to provide better code documentation.
Also, in the context of library linking, this is an information that the linker can't guess from just the function signature, so it could end up being actually useful on the user side.
In general, I would expect -O3 to spend enough cpu to discover which functions are pure, so it's unclear if xxhash will receive a measurable boost to performance with these function qualifiers.
It appears that with XXH_INLINE_ALL, Clang and GCC can't tell that XXH3 is pure without the annotation, but it can figure out XXH32 and XXH64.
A few months ago, you mentioned how XXH3 is threadable. Obviously this would be an opt-in feature, as some programs like compilers (which I know at least Clang and GNU LD use XXH64) are designed to remain on a single thread to parallelize with make
.
With some experimentation, it seems to be beneficial to spawn a second thread once you get to ~8-16MB.
On my phone (haven't tested on my PC yet because I have yet to master Windows threads), 6-8 MB seems to be the range where it is beneficial, with a max speed of 7.3 GB/s compared to 5.2 GB/s on one thread.
The implementation would be pretty simple; the most complicated thing here is dealing with the pthread struct and the accumulate loop which should probably be outlined to its own function to avoid copypasta.
I believe we can do a similar thing with CreateThread
on Windows.
Note that this would technically conflict with the __attribute__((__pure__))
idea, although if we talk about the end effects, it will still be pure.
As I mention in the comment, I don't see any reason to spawn more than one helper thread, as we waste hundreds of thousands of possible accumulate loops by setting up each pthread, meaning 4 threads would likely require a ridiculous 64-128 MB and a much more complicated error handling routine.
So I was wondering if we should start doing Doxygen? We don't necessarily have to set up a server for it.
Especially since xxhash.h
is massive now, having a little Doxygen site might help and would probably be easier than writing xxh3_spec.md
It also gives us some opportunity to document the internals because we can group them.
Here are some examples:
Also, didn't we plan on switching XXH64_hash_t
to unsigned long
on 64-bit Linux?
I was wondering if we should start doing Doxygen?
Yes, that's a good idea. Moving code comments to Doxygen parsing convention can be done progressively.
didn't we plan on switching
XXH64_hash_t
tounsigned long
on 64-bit Linux?
I don't see a benefit in such a change
I don't see a benefit in such a change
uint64_t
is unsigned long
on LP64, so it would be consistent.
$ cat test.c
#include <xxhash.h>
#include <stdio.h>
#include <inttypes.h>
int main(void)
{
printf("%#016" PRIx64 "\n", XXH64("hello", 5, 0));
return 0;
}
$ gcc -std=gnu99 -O2 -Wall -c test.c -I xxHash
// Ok
$ g++ -std=gnu++11 -O2 -Wall -c -xc++ test.c -I xxHash
// Ok
$ gcc -std=gnu90 -O2 -Wall -c test.c -I xxHash -Wno-long-long
test.c: In function 'main':
test.c:7:12: warning: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'XXH64_hash_t' {aka 'long long unsigned int'} [-Wformat=]
7 | printf("%#016" PRIx64 "\n", XXH64("Hello", 5, 0));
| ^~~~~~~ ~~~~~~~~~~~~~~~~~~~~
| |
| XXH64_hash_t {aka long long unsigned int}
In file included from test.c:3:
/usr/include/inttypes.h:127:34: note: format string is defined here
127 | #define PRIx64 __PRI_64_prefix"x" /* uint64_t */
$
In gnu90
mode, -Wformat
fires off because on LP64, PRIx64
is "lx"
.
The reverse is true if you do "%llx"
, it will fire a warning on C++ and C99.
Some Doxygen documentation added in #462
Declaring relevant functions as pure
and const
would be a good follow up.
XXH_SIZEOPT
config option
==0
: normal==1
: Disables forceinline and manual unrolling==2
: Reuse streaming API for single shot, other dirty size hacks?accumulate_doubleAcc()
{
xxh_u64 acc2[8] = {0};
size_t n = 0;
// alternative: duffs device but that might harm
// interleaving
if (nbStripes % 2 == 1) {
accumulate_512(acc2);
n++;
}
while (n < nbStripes) {
accumulate_512(acc);
n++;
accumulate_512(acc2);
n++;
}
for (n = 0; n < 8; n++) {
acc[n] += acc2[n];
}
}
I didn't see any benefit on NEON AArch64 no matter how many NEON lanes I chose, and ARMv7 and SSE2 don't have enough registers.
However, I think that AVX2 and AVX512 would likely benefit since their loops are tighter. I will benchmark on Ryzen when I get a chance.
Edit: Clang has no difference on Ryzen 5 3600 and GCC clearly gets confused.
Updated list of objectives for v0.8.2
I was considering (as a tentative objective) shipping DISPATCH=1
by default for the generation of xxhsum
CLI, but there is still a significant amount of work to do to reach this stage safely, and I'm concerned it would delay v0.8.2
release by too long.
So, instead, I've pushed this objective to an hypothetical v0.8.3
future release.
(will be folded into v0.9.0
if need be).
As for XXH_OLD_NAMES
, should we add deprecation message for it in v0.8.2
?
As for
XXH_OLD_NAMES
, should we add deprecation message for it inv0.8.2
?
This seems like a good idea.
If I understand properly, XXH_OLD_NAMES
is disabled by default, so the warning could be triggered just on detecting if it's set.
We can add "Milestone" via right side pane of issue view. And we also can assign/reuse it to other issues to indicate the milestone. It'll be useful at the future review of issues.
For example, we can assign v0.9.0
milestone to #483.
List updated June 30th 2023 :
Objectives for v0.8.2 - Completed
SVE
detection message at CLI prompt.XXH64
and maybeXXH32
? (tentative) (may not benefit) -> benefit is too small, abandoned.Objectives for v0.8.3
cmake
minimum version tov3.5
(#859) - completedxxhsum
withDISPATCH
enabled by default ? (tentative)AVX2
is enabled at compilation time (#839)arm64dispatch
variant, for SVE/NEON dispatch (tentative, based on #762)Objectives for v0.9.0
XXH_generateSecret()
XXH_OLDNAME
(or maybe v1.0 ?)xxhash.h
into multiple smaller files, and ship an ability to create a single amalgamated file from them ? (tentative)XXH_ENABLE_XXH32
,XXH_ENABLE_XXH64
andXXH_ENABLE_XXH3
XXH32
(withXXH_NO_LONG_LONG
), orXXH32
+XXH64
(withXXH_NO_XXH3
) or everything. It's not possible to compile selectively onlyXXH64
for example.libxxhash
dynamic library with runtime vector extension detection enabled by default (xxh_x64dispatch
) ? (tentative)DISPATCH=1
makefile parameter. In which case, it unconditionally addsxxh_x86dispatch.c
to the unit list, which crashes on non-x86 targets. Making this option default requires a safe capability to detectx86
targets at build or compilation time.DISPATCH
must therefore substitute its symbols to original ones.