Open arvid-norlander opened 1 year ago
@arvid-norlander For interpreter, could you try changing WASM_CPU_SUPPORTS_UNALIGNED_ADDR_ACCESS
to 0 in core/config.h? Or add add_definitions(-DWASM_CPU_SUPPORTS_UNALIGNED_ADDR_ACCESS=0)
in CMakeLists.txt.
For fast-jit, it really needs effort to modify the code, since we assume x86 supports unaligned memory access and there are many such cases, e.g., *(uintptr_t *)stream = (uintptr_t)stream + 11
Sure, I could try that. I'll get back to you on that shortly.
While x86 does support unaligned access, I'm not sure about what the compiler guarantees with respect to the standard. I believe it is allowed to optimize based on assuming that you are not performing unaligned access for example. At least I have seen this happen, where GCC incorrectly vectorised a loop to a sometimes unaligned array of uint32_t
, which would result in crashes in release builds.
With -DWASM_CPU_SUPPORTS_UNALIGNED_ADDR_ACCESS=0)
I get largely the same amount of errors for JIT and interpreter respectively. I also notice I get errors in the AOT case (which I didn't test before):
$ ./iwasm simple_wasm_x86-64.aot
/home/user/wasm-micro-runtime/core/iwasm/fast-jit/cg/x86-64/jit_codegen_x86_64.cpp:9317:32: runtime error: store to misaligned address 0x55edcf2e07d5 for type 'int32', which requires 4 byte alignment
0x55edcf2e07d5: note: pointer points here
3c 00 0f 85 00 00 00 00 4c 89 e7 48 c7 c6 11 00 00 00 48 b8 80 2f e4 ce ed 55 00 00 ff d0 b8 01
^
/home/user/wasm-micro-runtime/core/iwasm/aot/arch/aot_reloc_x86_64.c:87:22: runtime error: store to misaligned address 0x0000406b14ba for type 'uint64', which requires 8 byte alignment
0x0000406b14ba: note: pointer points here
00 00 48 b8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
/home/user/wasm-micro-runtime/core/iwasm/aot/arch/aot_reloc_x86_64.c:182:60: runtime error: store to misaligned address 0x0000406ac107 for type 'int32', which requires 4 byte alignment
0x0000406ac107: note: pointer points here
f2 0f 59 05 00 00 00 00 f2 0f 58 c1 c3 0f 1f 40 00 41 57 41 56 41 54 53 50 4c 8b 7f 10 49 8b 87
^
/home/user/wasm-micro-runtime/core/iwasm/aot/arch/aot_reloc_x86_64.c:252:60: runtime error: store to misaligned address 0x0000406ac1ae for type 'int32', which requires 4 byte alignment
0x0000406ac1ae: note: pointer points here
00 00 00 e8 00 00 00 00 31 c0 5b c3 0f b7 c0 5b c3 66 0f 1f 84 00 00 00 00 00 55 41 57 41 56 41
^
/home/user/wasm-micro-runtime/core/iwasm/aot/arch/aot_reloc_x86_64.c:208:60: runtime error: store to misaligned address 0x0000406ae367 for type 'int32', which requires 4 byte alignment
0x0000406ae367: note: pointer points here
3c ff 24 c5 00 00 00 00 45 85 e4 74 0b 83 7c 24 10 00 0f 88 11 25 00 00 48 8b 5c 24 30 f3 41 0f
^
52
I have not looked into performance differences (it would likely not show on this simple test program). The main reason I worry is if this might invoke undefined behavior in the optimiser. Even if x86 supports this fine, what is UB or not depends on the so called "abstract machine" that the C/C++ standards define. And it is this that determines what the compiler is allowed to make assumptions for optimization based on.
Based on some search, it seems my worries might be well founded:
And of course, I have my own experience of this being taken advantage of (with G++) where it auto-vectorized code which resulted in failures at runtime due to using aligned SSE instructions.
Hi, since x86-64/x86-32 supports memory access on unaligned address, currently we just do in that way in many places of the runtime implementation, it should be a normal behavior, there is no issue found and it may help reduce the footprint. For the machine code generated in JIT/AOT, if it is an SSE instruction, we will choose the instruction that supports memory access on unaligned address.
But if changing to memory access on aligned address may improve the performance, we are also grad to support, for example, add controls in interpreter/fast-jit/aot according to the value of WASM_CPU_SUPPORTS_UNALIGNED_ADDR_ACCESS. I am not sure why you got the same amount of errors in interpreter when the macro is 0/1, could you upload the wasm file?
Hi, since x86-64/x86-32 supports memory access on unaligned address, currently we just do in that way in many places of the runtime implementation, it should be a normal behavior, there is no issue found and it may help reduce the footprint.
I believe you are missing my point here. It may work in practice now. And if you wrote everything in assembly there would be no issues (since the behaviour you are then developing against is that of the actual x86/x86-64 architecture).
However, C/C++ doesn't actually specify behaviour in terms of how the target platform works (this is a common misconception). Instead the standard specifies what compilers can and cannot do in terms of an "abstract machine" defined in the standard. This is what we as C/C++ developers have to conform to when it comes to undefined behaviour.
And the standard says that unaligned access is undefined in terms of this abstract machine. This means that the compiler may do whatever it wants. It might work as you expect, it might not. It might suddenly stop working with a new compiler version, or with different compiler flags. Or even just due to something else changing elsewhere in the program.
Thus, relying on unaligned access working is brittle at best, and I have seen it fail in practice. There is absolutely no guarantees it won't stop working with the next GCC or clang versions due to some new optimization. As far as I know, there are only two safe ways to do unaligned reads in C/C++:
The C/C++ compiler also finally compiles the source code into assembly, AFAIK, most of the x86 instructions support unaligned memory access except few vectorize related instructions, so there seems only when the compiler optimizes the related vectorize operations into aligned memory access instructions can there be a failure occurring. But I believe our code for unaligned memory access is simple enough and there is no vectorize operations, and the compile cannot optimize them into instructions which require aligned memory access, and there are lots of cases tested. We had better keep the current implementation and if there is issue found in new C/C++ compiler, we can track and fix it.
I agree with @arvid-norlander that relying on undefined behavior is risky, and ideally these issues should be addressed by avoiding that behavior. However, since @wenyongh has stated a differing opinion, I've submitted PR #3436 to suppress these issues for the benefit of embedding applications that build with the undefined behavior sanitizer enabled.
except performance critical places, it's simpler to fix them, IMO.
Yes, maybe we can define some common functions in core/shared/utils/bh_common.h
by using memcpy:
static inline uint16
bh_get_uint16(const uint8 *buf)
{
uint16 ret;
bh_memcpy_s(&ret, sizeof(uint16), buf, sizeof(uint16));
return ret;
}
static inline uint32
bh_get_uint32(const uint8 *buf)
{
uint32 ret;
bh_memcpy_s(&ret, sizeof(uint32), buf, sizeof(uint32));
return ret;
}
static inline uint64
bh_get_uint64(const uint8 *buf)
{
uint64 ret;
bh_memcpy_s(&ret, sizeof(uint64), buf, sizeof(uint64));
return ret;
}
static inline void
bh_set_int16(uint8 *buf, int16 v)
{
bh_memcpy_s(buf, sizeof(int16), &v, sizeof(int16));
}
static inline void
bh_set_uint32(uint8 *buf, uint32 v)
{
bh_memcpy_s(buf, sizeof(uint32), &v, sizeof(uint32));
}
static inline void
bh_set_uint64(uint8 *buf, uint64 v)
{
bh_memcpy_s(buf, sizeof(uint64), &v, sizeof(uint64));
}
And then call them in code like *(uint64 *)p = (uint64)(uintptr_t)target_sym_map[i].symbol_addr;
, value = *(intptr_t *)(target_section_addr + (uint32)reloc_offset);
bh_set_int16
why is only bh_set_int16 signed? otherwise sounds reasonable to me.
iirc, clang is smart enough to turn this kind of functions into simple unaligned load/store in x86. (at least if you use an ordinary memcpy, not bh_ wrapper.)
bh_set_int16
why is only bh_set_int16 signed? otherwise sounds reasonable to me.
Sorry for typo, it should be bh_set_uint16.
iirc, clang is smart enough to turn this kind of functions into simple unaligned load/store in x86. (at least if you use an ordinary memcpy, not bh_ wrapper.)
Do you mean we should memcpy instead of bh_memcpy in these functions?
Do you mean we should memcpy instead of bh_memcpy in these functions?
if we end up with using these in performance critical path, maybe. otherwise, i guess it doesn't matter.
Do we know which places that currently trigger the undefined behavior sanitizer are also performance critical?
Normally the code in interpreter or AOT code is performance critical, I think the several functions you modified in PR #3436 should have little impact on performance, and as @yamt suggested, we can use memcpy
instead of bh_memcpy_s
in these bh_set_xxx/bh_get_xxx APIs to reduce the performance impact.
Building iwasm (or embedding into the program) with GCC undefined sanitizer when Fast JIT is enabled causes the following:
The interpreter also triggers such issues
I'm using WAMR-1.2.2. I notice that with the embedded build I can trigger this even without the Fast JIT, as long as the fast interpreter is enabled. With iwasm I need the Fast JIT to be enabled, though then I can trigger it for the interpreter as well.
The build command I used to reproduce this with iwasm on Linux was:
I'm using Ubuntu 22.04, GCC 11.3.0-1ubuntu1~22.04.1
Here is the wasm file I used: simple_wasm.wasm.tar.gz
I do believe that this isn't generally a problem on x86/x86-64 (it is on ARM though), but our company has a 0-ubsan (as well as asan/tsan) issues policy in the CI. And even on x86/x86-64 my understanding is that unaligned access is slower than aligned access (plus, in case of SSE/AVX by the compiler, outright unsupported, and it can potentially auto-vectorize based on assumed alignment from the type).