Fix unaligned load and store in interpreter

ghost commented 10 months ago

Unaligned loads and stores used a different code path based on how the address is passed (offset vs offset+addend), one used a endiannesAwareMemcpy, the other used casting and C++ assignment. The latter would result in SIGBUS errors on ARM due to unaligned loads and stores. The fix was pretty simple: just use memcpy in both cases, with the offset case being treated as a special case of offset+addend.

Possible bikeshedding opportunity: should the bounds check be removed from the offset variant? Compilers will likely optimize it away anyways, so I thought it's better to leave it in, just in case someone modifies the implementation and forgets to re-add the check.

zherczeg commented 10 months ago

Perhpas the reason is performance?

ksh8281 commented 10 months ago

It is for performance. You can check it with wasmBenchmark. IMO We can use another way to avoid it.

zherczeg commented 9 months ago

@ksh8281 do you have a suggestion for another way? This is an issue on arm32, where the compiler thinks the 64 bit value is 4 byte aligned, and uses the ldp instruction. The engine crashes with alignment fault then.

ksh8281 commented 9 months ago

How about to add specialized version of Memory::load, Memory::store for uint64_t in arm32? IMHO specialized version of Memory::load, Memory::store can use memcpy In my experience, if I use memcpy, compiler consider the dst address can be unaligned, so there is no crash on arm32 https://github.com/Samsung/escargot/blob/c12763a4df124db7f60a7299cd51d5b86e8986a3/src/codecache/CodeCacheReaderWriter.h#L130 Or use can divide 64bit read/write into 32bit read/write for arm32 https://github.com/Samsung/escargot/blob/c12763a4df124db7f60a7299cd51d5b86e8986a3/src/runtime/TypedArrayInlines.h#L172

zherczeg commented 9 months ago

As far as I know, the compilers do memcpy optimizations when possible. So they understand that you want to move an unaligned 64 bit value, and do it in the best way (e.g. two unaligned 32 bit loads). The proposed code also use memcpy because the called function is also just an endian aware memcpy. We could check what happens if we use memcpy everywhere.

ghost commented 9 months ago

It already gets specialized, because it's a template. I looked at the generated assembly, there was no call to memcpy (or at least I couldn't find one, but I'm double checking now with Ghidra), it was optimized out, just as expected. As far as I know, compilers always create new code for every template instantiation, because that tends to be better for speed, even though it's worse for code size.

ghost commented 9 months ago

These are all the references to Walrus::Memory::load in release mode on armhf according to arm-linux-gnueabihf-objdump -SldrwC out-shell/release/arm/walrus | grep Walrus::Memory::load:

void Walrus::Memory::load<Walrus::SIMDValue<unsigned int, (unsigned char)2> >(Walrus::ExecutionState&, unsigned int, unsigned int, Walrus::SIMDValue<unsigned int, (unsigned char)2>*) const:
void Walrus::Memory::load<Walrus::SIMDValue<int, (unsigned char)2> >(Walrus::ExecutionState&, unsigned int, unsigned int, Walrus::SIMDValue<int, (unsigned char)2>*) const:
void Walrus::Memory::load<Walrus::SIMDValue<unsigned short, (unsigned char)4> >(Walrus::ExecutionState&, unsigned int, unsigned int, Walrus::SIMDValue<unsigned short, (unsigned char)4>*) const:
void Walrus::Memory::load<Walrus::SIMDValue<short, (unsigned char)4> >(Walrus::ExecutionState&, unsigned int, unsigned int, Walrus::SIMDValue<short, (unsigned char)4>*) const:

As for store, there are no references:

% arm-linux-gnueabihf-objdump -SldrwC out-shell/release/arm/walrus | grep Walrus::Memory::store
%

I think it's pretty clear the compiler knows what it's doing.

ksh8281 commented 9 months ago

Could you measure performance?(on x64 and arm32). I wonder performance is same with both cases. If performance is same, we would use memcpy(since it is safe) even if you use endiannessAwareMemcpy, you may not find call memcpy. because memcpy is not general function for gcc and clang. it is special function for compiler, it is may inlined.

ksh8281 commented 9 months ago

https://godbolt.org/z/e988zo9bW

you can see the compiler generates different code for both cases

#include <cinttypes>
#include <cstring>

void store(unsigned char* buffer, uint32_t offset, const uint64_t val)
{
    *(reinterpret_cast<uint64_t*>(&buffer[offset])) = val;
}

void store2(unsigned char* buffer, uint32_t offset, const uint64_t val)
{
    memcpy(&buffer[offset], &val, sizeof(uint64_t));
}

store(unsigned char*, unsigned int, unsigned long long):
        add     r0, r0, r1
        strd    r2, [r0]
        bx      lr
store2(unsigned char*, unsigned int, unsigned long long):
        add     ip, r0, r1
        str     r2, [r0, r1]      @ unaligned
        str     r3, [ip, #4]      @ unaligned
        bx      lr

IMHO there is performance difference I think we can use use memcpy only for 64bit + arm32

FYI for 32bit, both cases generates same asm code

void store(unsigned char* buffer, uint32_t offset, const uint32_t val)
{
    *(reinterpret_cast<uint32_t*>(&buffer[offset])) = val;
}

void store2(unsigned char* buffer, uint32_t offset, const uint32_t val)
{
    memcpy(&buffer[offset], &val, sizeof(uint32_t));
}

store(unsigned char*, unsigned int, unsigned int):
        str     r2, [r0, r1]
        bx      lr
store2(unsigned char*, unsigned int, unsigned int):
        str     r2, [r0, r1]      @ unaligned
        bx      lr

zherczeg commented 9 months ago

The ldp is faster, but it does not support unaligned access. The walrus function needs to support the unaligned access. This is what we are trying to explain.

ksh8281 commented 9 months ago

When I tested, the r/w functions are critical to performance for interpreter. I think we can another version of the functions for supporting unaligned access in jit.(for x86, just calling original function)

zherczeg commented 9 months ago

There might be some misunderstanding here. This is an interpreter related issue, jit does not use them. The compiler assumes, that the 64 bit data is loaded from a 64 bit aligned address, but this is not always true for webassembly. On x86 this is not an issue, but on ARM, the compiler generates a load/store a pair of registers instruction, which does not support unaligned access for some reason:

https://developer.arm.com/documentation/ddi0308/d/Programmers--Model/Unaligned-access-support/Load-and-store-alignment-checks

The idea is using memcopy 8 bytes, which is optimized to the same code as the original on x86, but on arm it uses two 32 bit transfers instead of the specialized instruction.

ksh8281 commented 9 months ago

I think I understood it correctly the first time. From my experiments, SIGBUS only occurred when reading and writing 64-bit values on an ARM machine. When reading and writing 32-bit values, there were no restrictions like before.

So, my conclusion was that when reading and writing 32-bit values, use the existing method, and when reading and writing 64-bit (uint64_t) values, use memcpy on arm through template specialization(through implement Memory::load, store). Or you can solve this problem with another way.

If it has been tested that the compiler produces good code on many platforms even if using only memcpy, then it seems safe to use memcpy.

If I'm wrong, please let me know.

clover2123 commented 9 months ago

@zherczeg

The idea is using memcopy 8 bytes, which is optimized to the same code as the original on x86, but on arm it uses two 32 bit transfers instead of the specialized instruction.

This is right for the most cases. But when we look at other machines like RISC-V, it seems that compilers still generate rather poor codes for memcpy (you can find this case in RISC-V 32bit case). This imperfect optimization of memcpy might be an critical issue in the future.

zherczeg commented 9 months ago

I thought compilers are smart enough for this. I am sorry.

zherczeg commented 8 months ago

Maybe we can close this until we found a better solution.

clover2123 commented 8 months ago

What about choosing the method suggested in https://github.com/Samsung/walrus/pull/193#issuecomment-1874932438 by @ksh8281

So, my conclusion was that when reading and writing 32-bit values, use the existing method, and when reading and writing 64-bit (uint64_t) values, use memcpy on arm through template specialization(through implement Memory::load, store).

zherczeg commented 8 months ago

New patch: #226

Samsung / walrus

Fix unaligned load and store in interpreter #193