Closed ghost closed 7 months ago
Perhpas the reason is performance?
It is for performance. You can check it with wasmBenchmark. IMO We can use another way to avoid it.
@ksh8281 do you have a suggestion for another way? This is an issue on arm32, where the compiler thinks the 64 bit value is 4 byte aligned, and uses the ldp instruction. The engine crashes with alignment fault then.
How about to add specialized version of Memory::load, Memory::store for uint64_t in arm32? IMHO specialized version of Memory::load, Memory::store can use memcpy In my experience, if I use memcpy, compiler consider the dst address can be unaligned, so there is no crash on arm32 https://github.com/Samsung/escargot/blob/c12763a4df124db7f60a7299cd51d5b86e8986a3/src/codecache/CodeCacheReaderWriter.h#L130 Or use can divide 64bit read/write into 32bit read/write for arm32 https://github.com/Samsung/escargot/blob/c12763a4df124db7f60a7299cd51d5b86e8986a3/src/runtime/TypedArrayInlines.h#L172
As far as I know, the compilers do memcpy
optimizations when possible. So they understand that you want to move an unaligned 64 bit value, and do it in the best way (e.g. two unaligned 32 bit loads). The proposed code also use memcpy
because the called function is also just an endian aware memcpy
. We could check what happens if we use memcpy
everywhere.
It already gets specialized, because it's a template. I looked at the generated assembly, there was no call to memcpy
(or at least I couldn't find one, but I'm double checking now with Ghidra), it was optimized out, just as expected.
As far as I know, compilers always create new code for every template instantiation, because that tends to be better for speed, even though it's worse for code size.
These are all the references to Walrus::Memory::load
in release mode on armhf according to arm-linux-gnueabihf-objdump -SldrwC out-shell/release/arm/walrus | grep Walrus::Memory::load
:
void Walrus::Memory::load<Walrus::SIMDValue<unsigned int, (unsigned char)2> >(Walrus::ExecutionState&, unsigned int, unsigned int, Walrus::SIMDValue<unsigned int, (unsigned char)2>*) const:
void Walrus::Memory::load<Walrus::SIMDValue<int, (unsigned char)2> >(Walrus::ExecutionState&, unsigned int, unsigned int, Walrus::SIMDValue<int, (unsigned char)2>*) const:
void Walrus::Memory::load<Walrus::SIMDValue<unsigned short, (unsigned char)4> >(Walrus::ExecutionState&, unsigned int, unsigned int, Walrus::SIMDValue<unsigned short, (unsigned char)4>*) const:
void Walrus::Memory::load<Walrus::SIMDValue<short, (unsigned char)4> >(Walrus::ExecutionState&, unsigned int, unsigned int, Walrus::SIMDValue<short, (unsigned char)4>*) const:
As for store
, there are no references:
% arm-linux-gnueabihf-objdump -SldrwC out-shell/release/arm/walrus | grep Walrus::Memory::store
%
I think it's pretty clear the compiler knows what it's doing.
Could you measure performance?(on x64 and arm32). I wonder performance is same with both cases. If performance is same, we would use memcpy(since it is safe) even if you use endiannessAwareMemcpy, you may not find call memcpy. because memcpy is not general function for gcc and clang. it is special function for compiler, it is may inlined.
https://godbolt.org/z/e988zo9bW
you can see the compiler generates different code for both cases
#include <cinttypes>
#include <cstring>
void store(unsigned char* buffer, uint32_t offset, const uint64_t val)
{
*(reinterpret_cast<uint64_t*>(&buffer[offset])) = val;
}
void store2(unsigned char* buffer, uint32_t offset, const uint64_t val)
{
memcpy(&buffer[offset], &val, sizeof(uint64_t));
}
store(unsigned char*, unsigned int, unsigned long long):
add r0, r0, r1
strd r2, [r0]
bx lr
store2(unsigned char*, unsigned int, unsigned long long):
add ip, r0, r1
str r2, [r0, r1] @ unaligned
str r3, [ip, #4] @ unaligned
bx lr
IMHO there is performance difference I think we can use use memcpy only for 64bit + arm32
FYI for 32bit, both cases generates same asm code
void store(unsigned char* buffer, uint32_t offset, const uint32_t val)
{
*(reinterpret_cast<uint32_t*>(&buffer[offset])) = val;
}
void store2(unsigned char* buffer, uint32_t offset, const uint32_t val)
{
memcpy(&buffer[offset], &val, sizeof(uint32_t));
}
store(unsigned char*, unsigned int, unsigned int):
str r2, [r0, r1]
bx lr
store2(unsigned char*, unsigned int, unsigned int):
str r2, [r0, r1] @ unaligned
bx lr
The ldp
is faster, but it does not support unaligned access. The walrus function needs to support the unaligned access. This is what we are trying to explain.
When I tested, the r/w functions are critical to performance for interpreter. I think we can another version of the functions for supporting unaligned access in jit.(for x86, just calling original function)
There might be some misunderstanding here. This is an interpreter related issue, jit does not use them. The compiler assumes, that the 64 bit data is loaded from a 64 bit aligned address, but this is not always true for webassembly. On x86 this is not an issue, but on ARM, the compiler generates a load/store a pair of registers instruction, which does not support unaligned access for some reason:
The idea is using memcopy 8 bytes, which is optimized to the same code as the original on x86, but on arm it uses two 32 bit transfers instead of the specialized instruction.
I think I understood it correctly the first time. From my experiments, SIGBUS only occurred when reading and writing 64-bit values on an ARM machine. When reading and writing 32-bit values, there were no restrictions like before.
So, my conclusion was that when reading and writing 32-bit values, use the existing method, and when reading and writing 64-bit (uint64_t) values, use memcpy on arm through template specialization(through implement Memory::load, store
If it has been tested that the compiler produces good code on many platforms even if using only memcpy, then it seems safe to use memcpy.
If I'm wrong, please let me know.
@zherczeg
The idea is using memcopy 8 bytes, which is optimized to the same code as the original on x86, but on arm it uses two 32 bit transfers instead of the specialized instruction.
This is right for the most cases. But when we look at other machines like RISC-V
, it seems that compilers still generate rather poor codes for memcpy
(you can find this case in RISC-V 32bit case). This imperfect optimization of memcpy might be an critical issue in the future.
I thought compilers are smart enough for this. I am sorry.
Maybe we can close this until we found a better solution.
What about choosing the method suggested in https://github.com/Samsung/walrus/pull/193#issuecomment-1874932438 by @ksh8281
So, my conclusion was that when reading and writing 32-bit values, use the existing method, and when reading and writing 64-bit (uint64_t) values, use memcpy on arm through template specialization(through implement Memory::load, store
).
New patch: #226
Unaligned loads and stores used a different code path based on how the address is passed (
offset
vsoffset+addend
), one used aendiannesAwareMemcpy
, the other used casting and C++ assignment. The latter would result inSIGBUS
errors on ARM due to unaligned loads and stores. The fix was pretty simple: just use memcpy in both cases, with theoffset
case being treated as a special case ofoffset+addend
.Possible bikeshedding opportunity: should the bounds check be removed from the
offset
variant? Compilers will likely optimize it away anyways, so I thought it's better to leave it in, just in case someone modifies the implementation and forgets to re-add the check.