Closed anholt closed 4 years ago
Thanks for notification @anholt .
I was looking at the commit message, but was unable to determine with certainty if the problem comes from aliasing or from unaligned reads. The commit message talks about aliasing, but the attached fix effectively impacts unaligned reads and nothing else.
edit : OK, my bad, incorrect read, the problem was not about "aliasing", but about "strict-aliasing", which is different.
I needed both of the defines to avoid u32 access without getting deeper into xxh changes:
My actual failing data is 32 bit aligned in pointer address.
(as a side note, the platform affected was arm64)
The unaligned memory access path is supposed to use memcpy()
by default. So I'm surprised you needed to set XXH_FORCE_MEMORY_ACCESS == 0
, I would expect this value to be already set at 0
.
As for the aligned data shortcut, well that's an interesting (and unexpected) development.
The aligned data shortcut will read data directly as xxh_u32*
on the ground that the pointer address is 32-bit aligned, which has been previously verified.
However, it's true that, if the buffer was previously filled using another pointer type, the compiler is free to consider that both actions are unrelated, and even order them differently.
The thing that was supposed to protect such a scenario from happening is that there are several layers of invocations using const void*
pointers before reaching the function that actually reads the data, XXH_readLE32_align()
, which uses xxh_u32*
.
And if do remember correctly, void*
pointers and char*
pointers are supposed immune to strict aliasing rule, meaning that the compiler shall never blindly consider that the pointer doesn't alias anything.
Now, if the compiler just rips through the layers of indirection due to inlining, and goes directly to the 32-bit read instruction, then it could forget the void*
intermediate reference, notice the different pointer types between production from ingestion, and then possibly re-order them.
It's a pity, as I was expecting the const void*
intermediate indirection to protect against such a scenario. Apparently that's not the case, so that will require more careful reading of the strict aliasing rule and potential work-around (beside removing the aligned data shortcut).
While waiting for a solution, a quick workaround could be to disable the aligned data shortcut by default.
I'll go double check whether I needed to set both, maybe it was just that I added the alignment force afterwards.
In my reading trying to sort this out, I didn't find anything saying that an intermediate cast to void or char helped avoid an aliasing violation, only actual access through char *. This makes sense -- if the compiler could prove that a usage in one place just a set of casts away from a usage in another place, then it could just do the right thing and you wouldn't even need these aliasing rules!
I will likely need a reproduction case in order to properly analyze this impact, and eventually find a solution if there is.
My understanding of the problem makes it entirely strict-aliasing related, and enabled with inlining. If that's so, it should be possible to reproduce it on a regular x64
system.
For the reference, GCC does chose the direct cast on AArch64, but not Clang.
I wonder if we still need this — compiling ARM32 GCC right now to see if if does proper unaligned loads without the hack.
GCC 10 generates worse code with memcpy for ARMv7 non-NEON, despite making proper unaligned loads. :thinking:
However, it actually does almost as good with the UB-proof byteshift method, more than enough for me to say :+1:
Compiled on device because I would rather compile GCC 10 than hook up to my PC.
arm-linux-androideabi-gcc-10 -O3 -DXXH_FORCE_MEMORY_ACCESS=$n -S -march=armv7-a -mfpu=softvfp xxhash.c -o xxhash.s
# constant
clang -m32 -O3 -march=armv7-a -mfpu=softfp -c xxhsum.c -o xxhsum.o
clang -m32 -march=armv7-a -mfpu=softfp -static xxhash.s xxhsum.o -o xxhsum
XXH_FORCE_MEMORY_ACCESS=0:
Sample of 100 KB...
XXH32 : 102400 -> 25265 it/s ( 2467.3 MB/s)
XXH32 unaligned : 102400 -> 26409 it/s ( 2579.0 MB/s)
XXH64 : 102400 -> 11232 it/s ( 1096.9 MB/s)
XXH64 unaligned : 102400 -> 10177 it/s ( 993.8 MB/s)
XXH3_64b : 102400 -> 13120 it/s ( 1281.2 MB/s)
XXH3_64b unaligned : 102400 -> 12321 it/s ( 1203.2 MB/s)
XXH3_64b w/seed : 102400 -> 13047 it/s ( 1274.1 MB/s)
XXH3_64b w/seed unaligned : 102400 -> 13008 it/s ( 1270.3 MB/s)
XXH3_64b w/secret : 102400 -> 10587 it/s ( 1033.9 MB/s)
XXH3_64b w/secret unaligned : 102400 -> 10408 it/s ( 1016.4 MB/s)
XXH128 : 102400 -> 11510 it/s ( 1124.0 MB/s)
XXH128 unaligned : 102400 -> 11474 it/s ( 1120.5 MB/s)
XXH128 w/seed : 102400 -> 11383 it/s ( 1111.6 MB/s)
XXH128 w/seed unaligned : 102400 -> 11149 it/s ( 1088.7 MB/s)
XXH128 w/secret : 102400 -> 9372 it/s ( 915.3 MB/s)
XXH128 w/secret unaligned : 102400 -> 9253 it/s ( 903.7 MB/s)
XXH_FORCE_MEMORY_ACCESS=1
Sample of 100 KB...
XXH32 : 102400 -> 25353 it/s ( 2475.9 MB/s)
XXH32 unaligned : 102400 -> 27181 it/s ( 2654.4 MB/s)
XXH64 : 102400 -> 11807 it/s ( 1153.0 MB/s)
XXH64 unaligned : 102400 -> 12265 it/s ( 1197.8 MB/s)
XXH3_64b : 102400 -> 23028 it/s ( 2248.8 MB/s)
XXH3_64b unaligned : 102400 -> 22140 it/s ( 2162.1 MB/s)
XXH3_64b w/seed : 102400 -> 23483 it/s ( 2293.3 MB/s)
XXH3_64b w/seed unaligned : 102400 -> 22078 it/s ( 2156.1 MB/s)
XXH3_64b w/secret : 102400 -> 21520 it/s ( 2101.5 MB/s)
XXH3_64b w/secret unaligned : 102400 -> 21320 it/s ( 2082.1 MB/s)
XXH128 : 102400 -> 22544 it/s ( 2201.6 MB/s)
XXH128 unaligned : 102400 -> 20747 it/s ( 2026.1 MB/s)
XXH128 w/seed : 102400 -> 22945 it/s ( 2240.7 MB/s)
XXH128 w/seed unaligned : 102400 -> 21903 it/s ( 2139.0 MB/s)
XXH128 w/secret : 102400 -> 20876 it/s ( 2038.7 MB/s)
XXH128 w/secret unaligned : 102400 -> 19951 it/s ( 1948.3 MB/s)
XXH_FORCE_MEMORY_ACCESS=3:
Sample of 100 KB...
XXH32 : 102400 -> 24292 it/s ( 2372.3 MB/s)
XXH32 unaligned : 102400 -> 27166 it/s ( 2652.9 MB/s)
XXH64 : 102400 -> 11853 it/s ( 1157.6 MB/s)
XXH64 unaligned : 102400 -> 11636 it/s ( 1136.3 MB/s)
XXH3_64b : 102400 -> 21557 it/s ( 2105.2 MB/s)
XXH3_64b unaligned : 102400 -> 21644 it/s ( 2113.7 MB/s)
XXH3_64b w/seed : 102400 -> 22328 it/s ( 2180.5 MB/s)
XXH3_64b w/seed unaligned : 102400 -> 21162 it/s ( 2066.6 MB/s)
XXH3_64b w/secret : 102400 -> 21337 it/s ( 2083.7 MB/s)
XXH3_64b w/secret unaligned : 102400 -> 20872 it/s ( 2038.3 MB/s)
XXH128 : 102400 -> 22787 it/s ( 2225.3 MB/s)
XXH128 unaligned : 102400 -> 22348 it/s ( 2182.4 MB/s)
XXH128 w/seed : 102400 -> 23463 it/s ( 2291.3 MB/s)
XXH128 w/seed unaligned : 102400 -> 22669 it/s ( 2213.8 MB/s)
XXH128 w/secret : 102400 -> 21969 it/s ( 2145.4 MB/s)
XXH128 w/secret unaligned : 102400 -> 21279 it/s ( 2078.1 MB/s)
aarch64-linux-android-gcc-9
(not compiling GCC 10 again) prefers memcpy. Byteshift is slightly slower than the reinterpret cast, which is very slightly slower than memcpy.
It is within ~150MB/s, though, so it's a very minor difference.
Damn it GCC, you're so inconsistent. :unamused:
Note that XXH_FORCE_MEMORY_ACCESS=2 bus errors, targeting both ARMv7-a and ARMv6T2.
Neither GCC or Clang can safely use this. Let's kill it.
I'd rather an older GCC version run a bit slower than bus errors.
How about XXH_FORCE_MEMORY_ACCESS > 0
uses byteshift? Get rid of the UB, since it doesn't seem to have much of a benefit.
So I'm surprised you needed to set XXH_FORCE_MEMORY_ACCESS == 0, I would expect this value to be already set at 0.
When I look at the code I see:
#ifndef XXH_FORCE_MEMORY_ACCESS /* can be defined externally, on command line for example */
# if !defined(__clang__) && defined(__GNUC__) && defined(__ARM_FEATURE_UNALIGNED) && defined(__ARM_ARCH) && (__ARM_ARCH == 6)
# define XXH_FORCE_MEMORY_ACCESS 2
# elif !defined(__clang__) && ((defined(__INTEL_COMPILER) && !defined(_WIN32)) || \
(defined(__GNUC__) && (defined(__ARM_ARCH) && __ARM_ARCH >= 7)))
# define XXH_FORCE_MEMORY_ACCESS 1
# endif
#endif
So, if I don't define that to 0 myself, we'll do plain old u32 * access on armv6, and packed u32 union on armv7+. Just XXH_FORCE_ALIGN_CHECK 0
is enough to work around my failure, but given that I've been bitten already I don't want to do anything to invoke UB.
A first action is to reproduce the issue. I hope that's possible. That would make it possible to study a fix.
In the meantime, a reasonable work-around would be to not automatically enable XXH_FORCE_MEMORY_ACCESS = 2
on armv6 when inlining is detected. Or even never enable it.
packed unions should work fine. Their main issue is that this access method be limited to a few compliant compilers, as opposed to the memcpy()
which is highly portable.
Note that XXH_FORCE_MEMORY_ACCESS=2 bus errors, targeting both ARMv7-a and ARMv6T2.
That's known. Several architectures generate unsafe assembly code with this memory access. For armv7
, that's because the compiler may decide to use some faster aligned load instructions, which is obviously incorrect when input is not aligned. But even x64
can go wrong, as the compiler could decide to use aligned vector load instructions to speed up loading, segfaulting the same way.
This access mode is by far the most dangerous, and is only there to handle pathological cases of bad performance due to poor compiler optimization when no other method works.
Doing it only for the INLINE define is not good enough, because with LTO enabled, you can get your code inlined later anyway. You really just have to not invoke UB.
By the way, do you have an isolated example of the code causing bug? It might be useful to add to the self tests/diagnose.
It's highly dependent on compiler version (gcc 9.3.0 for me here, others didn't see it), opt flags, and surrounding code (you have to memset the struct before setup, don't = { } it, don't print the hash value from inside the key_hash function). Given all that, trying to make a testcase feels like a futile exercise to me.
We may not be able to make it a CI test, but even if it's just one test working on one configuration, it is still useful to study a fix.
Since it's primarily a strict aliasing issue, I see it as being primarily a C compiler issue, (mostly) independent from underlying hardware architecture (I say "mostly" because the decision to re-order instructions could be a consequence of evaluating a cost function for a given architecture model, so the issue may end up being not visible on a different one). gcc v9.3.0
is a good starting point.
In the original issue at Mesa bug tracker, Eric suggested using __attribute__((__may_alias__))
(supported in GCC since ~2005), but it would be good to know how pessimistic code does GCC produce when using this for data accesses.
AFAIK clang and LLVM is extremaly good at recognizing memcpy patterns and optimizing them into highly optimized code while preserving correct aliasing semantics. But I don't think gcc is doing it that so good.
Quick question :
In the linked code, I see that the issue seems located at this function :
static uint32_t
key_hash(const void *_key)
{
const struct fd6_texture_key *key = _key;
return XXH32(key, sizeof(*key), 0);
}
so when address of key
is aligned on 4-bytes boundaries (which it probably is), XXH32()
on arm
will use the "direct" path, ingesting input using const xxh_u32*
type.
What about the following variant ?
static uint32_t
key_hash(const void *_key)
{
return XXH32(_key, sizeof(struct fd6_texture_key), 0);
}
Is it still affected by strict aliasing ?
Also : is there any gcc
warning triggered on using -Wstrict-aliasing=1
?
C aliasing rules don't care about your casts there, it cares about the types that the eventual memory access (pointer dereferences) happen through. No -Wstrict-aliasing=1 warnings in the area of this code (but then, there's a lot of code for the compiler to see through to try to detect a strict aliasing failure here)
Well, the char*
is a famous exception, so the compiler has to care about pointer cast, if only to track this exception.
Also memcpy(void*, const void*, size_t)
works too, I never heard of a memcpy()
being re-ordered on the ground that it operates on a different type (void*
) than the one used to fill the copied buffer. And that's without using a char*
type at its interface. So there must be some way the compiler understands that these operations are interdependent and can't be freely reordered.
Wether it's part of the language, or not (compiler-specific mechanism or direct assembly come to mind), that is the part I don't know. Whichever it is, it would be instructive in order to formulate a fix.
edit : I've been trying to reproduce the issue described above. So far no luck.
Another work-around I was considering :
the reported issue is happening on aarch64
.
And, it's a consequence of following the "aligned memory access" path.
The existence of this path is controlled by XXH_FORCE_ALIGN_CHECK
.
XXH_FORCE_ALIGN_CHECK
is automatically disabled on x86
/x64
, on the ground that these platforms have a very efficient unaligned memory access, and therefore don't need a specific "aligned memory access" path.
As far as I know, aarch64
cpus are all compatible with unaligned memory access. And there is no instruction difference between loading from an aligned memory and an unaligned one. Which means it's the same situation as x86
/x64
.
As a consequence, it seems valid to simply add aarch64
to the list of architectures for which XXH_FORCE_ALIGN_CHECK
is disabled by default.
That would have avoided the problem described in this issue.
It's a work-around because it circumvents the issue without addressing the root cause (presumed related to strict-aliasing). But without a reproduction case, this second step can be a lot more difficult to fix.
The result of the aliasing rules:
An intermediate cast to (char *) does nothing, it's the lvalue that matters. memcpy works because the lvalues that memcpy accesses are char. Some spec citation at https://www.approxion.com/pointers-c-part-iii-strict-aliasing-rule/
Ah, I was wrong about memcpy doing char access. Apparently there's just specific spec trying to make it do what it's supposed to:
If a value is copied into an object having no declared type using memcpy or memmove, or is copied as an array of character type, then the effective type of the modified object for that access and for subsequent accesses that do not modify the value is the effective type of the object from which the value is copied, if it has one.
but that still leaves me a bit confused
https://stackoverflow.com/questions/25060794/writing-memcpy-conformant-with-strict-aliasing
The specific issue starting this thread (strict aliasing issue with XXH_INLINE_ALL
on aarch64
) has been mitigated by making the "direct path" no longer active by default on aarch64
(same as x86
or x64
).
I've been trying to find a more general solution, but I haven't. The "only" portable solution is to use memcpy()
, which is exactly the option which destroys performance for some families of cpu and compilers. Beyond that, advertised solutions quickly become compiler-specific.
Anyway, I don't plan to spend more time on this topic, and this the situation is fixed for aarch64
, I'm going to close this issue.
In Mesa we just started using xxhash (inlined in many callsites), and the compiler in some circumstances produces different hash results that appear to be due to the strict aliasing violations.
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5271