Access Violation in .NET application (with `wasmtime-dotnet`) when using current Wasmtime build

kpreisser commented 1 year ago

Hi,

we are using Wasmtime via wasmtime-dotnet in a .NET 6.0 application, where we load a 8 MB WASM file and execute its _start function. However, this causes an access violation on Windows in wasmtime_func_call when using Wasmtime from commit ff5abfd9938c78cd75c6c5d6a41565097128c5d3.

The crash started to happen toegher with issue #5328 (with commit fc62d4ad65892837a1e1af5a9d67c8b0d4310efe), which is why I also reported it there, but it could be that this was just a trigger and the actual root case is a different one. It could also be that once #5328 is fixed, the access violation is no longer reproducible.

I can reproduce the access violation with the following steps on Windows 10 x64 Build 19044, and on Windows Server 2019:

Build the Wasmtime C API from commit ff5abfd9938c78cd75c6c5d6a41565097128c5d3 on Windows for the current host (x86_64-pc-windows-msvc) as release build: cargo build --release --manifest-path crates/c-api/Cargo.toml
Install the .NET SDK 6.0.403 for Windows x64, which includes the .NET 6.0.11 runtime.
Clone https://github.com/kpreisser/wasmtime-dotnet, and switch to branch repro-wasmtime-access-violation, which contains a commit that adds the repro project on top of the current wasmtime-dotnet code. (The code contains a lot of fields/variables and some methods that are not used, but that was as best as I could get it while still reproducing the crash.)
cd into the repo's ReproWasmtimeAccessViolation folder, then run dotnet build ReproWasmtimeAccessViolation.csproj.
cd bin\Debug\net6.0, then copy wasmtime.dll and wasmtime.pdb from the build result into this directory.
Download the WASM file as .Zip file here (note: rename the .mov file to .zip ; I renamed it to be able to upload it here): https://user-images.githubusercontent.com/13289184/203989968-4639d6d5-15c9-4e57-bf01-c2eccd756642.mov, and copy the file dotnet-codabix-wasm-build.wasm also into this directory.
Run ReproWasmtimeAccessViolation.exe.

After some seconds, the program will exit if the access violation occured (and in Windows Error Reporting, there will be an error entry in the Application log ("Application Error"). If it didn't occur, the program will print "wasmtime_func_call completed! This should not be displayed if the Access Violation occured.", and you can use Ctrl+C to exit.

Unfortunately, the crash doesn't alway occur, it might also depend on the hardware or other factors. You may have to try a few times if the crash doesn't occur, or try again at a later time.

When attaching Visual Studio 2022 as debugger (using "native code" mode), it will show the following:

Exception thrown at 0x000002BB0392BD19 in ReproWasmtimeAccessViolation.exe: 0xC0000005: Access violation writing location 0x000000CA214E4E80.

Call Stack:

>   000002bb0392bd19()  Unknown
    000002bb0392b8c4()  Unknown
    000002bb0429bb8c()  Unknown
    000002bb03e5b2b2()  Unknown
    000002bb03e616c0()  Unknown
    000002bb03e5d4a3()  Unknown
    000002bb03a369dc()  Unknown
    000002bb03951aaf()  Unknown
    000002bb0392b8c4()  Unknown
    000002bb0429bb8c()  Unknown
    000002bb03e5b2b2()  Unknown
    000002bb03e616c0()  Unknown
    000002bb03e5d4a3()  Unknown
    000002bb038caa53()  Unknown
    000002bb03a2c701()  Unknown
    000002bb039f78bc()  Unknown
    000002bb0392b8c4()  Unknown
    000002bb0429bb8c()  Unknown
    000002bb03e5b2b2()  Unknown
    000002bb03e616c0()  Unknown
    000002bb03e5d4a3()  Unknown
    000002bb038caa53()  Unknown
    000002bb03a2c701()  Unknown
    000002bb039f78bc()  Unknown
    000002bb0392b8c4()  Unknown
    000002bb0429bb8c()  Unknown
    000002bb03e5b2b2()  Unknown
    000002bb03e5a965()  Unknown
    000002bb03ac8ee0()  Unknown
    000002bb03ac3058()  Unknown
    000002bb04294c00()  Unknown
    000002bb042df4c4()  Unknown
    000002bb03aaa51f()  Unknown
    000002bb0382174a()  Unknown
    000002bb038210d4()  Unknown
    000002bb044cf7bc()  Unknown
    000002bb044cfdbb()  Unknown
    wasmtime.dll!wasmtime_setjmp() C
    wasmtime.dll!_ZN16wasmtime_runtime12traphandlers84_$LT$impl$u20$wasmtime_runtime..traphandlers..call_thread_state..CallThreadState$GT$4with17h74c010e2e721401eE()  Unknown
    wasmtime.dll!_ZN16wasmtime_runtime12traphandlers11catch_traps17h7bef31bead668a92E()    Unknown
    wasmtime.dll!_ZN8wasmtime4func4Func18call_unchecked_raw17hcb0b0b7982c6fc66E.llvm.11183561978299839151()    Unknown
    wasmtime.dll!_ZN8wasmtime4func4Func9call_impl17hcb50041b4478fa06E.llvm.11183561978299839151()  Unknown
    wasmtime.dll!wasmtime_func_call()  Unknown
    00007ffd5615ce96()  Unknown
    00007ffd5615cc68()  Unknown
    00007ffd5615c826()  Unknown
    00007ffd5615c66b()  Unknown
    00007ffd5615a56b()  Unknown
    00007ffd5615a0ac()  Unknown
    00007ffd55f3fd86()  Unknown
    00007ffd55f3ef8b()  Unknown
    System.Private.CoreLib.dll!00007ffd3609274f()   Unknown
    coreclr.dll!00007ffdb5aaaac3()  Unknown
    coreclr.dll!00007ffdb59a6d9c()  Unknown
    coreclr.dll!00007ffdb5a8c413()  Unknown
    coreclr.dll!00007ffdb59836f5()  Unknown
    coreclr.dll!00007ffdb59835fa()  Unknown
    coreclr.dll!00007ffdb5983419()  Unknown
    kernel32.dll!00007ffdeafe74b4() Unknown
    ntdll.dll!00007ffdeb2c26a1()    Unknown

Additional notes:

On my system, the crash reproduces in about 50% of the cases.
I can't seem to reproduce the crash when using Wasmtime debug build.

Environments where I can reproduce the crash:

Environment 1:
- Hardware: Apple Mac mini (2018), with Parallels Desktop to run the Windows VM
- CPU: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz 3.19 GHz
- VM (Parallels Desktop on macOS): Windows 10 Pro x64 Build 19044 (in a VM)
Environment 2:
- Hardware: Intel NUC8i7BEH
- CPU: Intel Core i7-8559U
- VM (VMware Player 16 on Window 10 Build 19044): Windows Server 2019 Build 17763

Thank you!

alexcrichton commented 1 year ago

Thanks for this! I've read over the source and I don't have a Windows machine on-hand to reproduce with but I've been trying to debug by correlating what's happening in the compiled artifact using the addresses from the backtrace you have. Unfortunately though the image I'm compiling locally seems to have functions at different offsets so I can't get things to line up.

Reading over the code though it looks like this has threading involved, possible GC interactions with .NET, and additionally the reproduction you mention is nondeterministic. @peterhuene would you be able to take a look at this? I've encountered similar-ish issues with the Go bindings historically and it was an often an error in the bindings and how the bindings interacted with the GC so I'm curious if anything jumps out at you here.

kpreisser commented 1 year ago

Hi @alexcrichton thanks for your reply!

That's a good point about the GC, which also applies to the repro project. I modified the code of the repro project to ensure the Wasmtime objects (Config, Engine, Module, Linker, Store) are kept alive while the wasm function is executed (like it is also the case in our real application), so that there shouldn't be any possibility for the GC to prematurely call the finalizer of those objects (or their referenced SafeHandles) when it detects that the references are no longer in use. (Sometimes, the .NET runtime can even consider the this pointer within an instance method (while it's still executing) to no longer be alive when this isn't used in that method.)

The access violation still reproduces for me with these changes. I also tried putting break points in the various Handle.ReleaseHandle() methods in Wasmtime.Dotnet for the above mentioned objects (which call wasm_config_delete, wasm_engine_delete, wasmtime_module_delete, wasmtime_linker_delete, and wasmtime_store_delete), and it seems they are not called when the access violation occurs; only when it doesn't occur they are getting called as expected.

Thanks!

peterhuene commented 1 year ago

@kpreisser thanks for the report and the additional details. I'll investigate this immediately and follow up when I have new information to share.

peterhuene commented 1 year ago

The AV is a result of a wasm frame overflowing the stack, where the frame size exceeds the guard page size (thus it tries to write to unmapped memory). It is passing the overflow check in the prologue.

I haven't looked too closely at the cranelift issue, but I suspect it's related (digging into it now).

alexcrichton commented 1 year ago

Oh! I noticed the stack size increase which now that I think about it I think that may be larger than the default 1MB stack size on Windows?

The stack safety in Wasmtime relies on the embedder ensuring there's at least that much wasm stack available so if the new thread doesn't have 2MB+ of wasm stack then I think this could happen.

peterhuene commented 1 year ago

It repros with the default wasm stack size too (that is indeed a misconfiguration without also increasing the stack size when creating the thread).

alexcrichton commented 1 year ago

Ah ok makes sense!

This is perhaps a case where Cranelift could help out where we could insert stack probes in addition to the stack check we have today where the probes could guarantee a segfault to deterministically exit. I don't think we'll want to catch this as a stack overflow trap but reporting this with a better error message I think could be useful.

peterhuene commented 1 year ago

I stand corrected again. It is still reproing with the default wasm stack size; the AV is inconsistent (probably getting some mapped writable memory beyond the guard page on some runs).

I'm digging into why it's not triggering the overflow trap.

jameysharp commented 1 year ago

Cranelift has an enable_probestack option, along with a couple choices for probestack_strategy and a knob for probestack_size_log2. Wasmtime enforces that these options are disabled though, in check_compatible_with_shared_flag.

kpreisser commented 1 year ago

Thank you for the investigation and updates!

Regarding the stack size, note that the access violation also occurs if I set the thread stack size to 8 MiB (8388608), e.g. in the Thread constructor, or by using editbin /stack:xyz to set the default thread stack size (the editbin command is actually what we do in our real application, where the access violation also occurs). So I would think the 2 MiB stack size in Wasmtime should be fine when the thread's stack size is 8 MiB? (I have updated the repro project to set the thread's stack size to 8 MiB; I previously forgot this there.)

Thanks!

bjorn3 commented 1 year ago

On Windows we need stackprobes anyways to grow the stack. As I understand it on Windows only a small part is committed by default and directly below it is a guard page which if accessed will cause the stack to grow a bit. If you access even a single page below the guard page you will get an AV even though it is still in the part of the memory reserved for the stack. cg_clif also had access violations on windows in some cases which it fixed by enabling inline stack probes. cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/1261#issuecomment-1219307630

peterhuene commented 1 year ago

@bjorn3 thanks for the reminder as I've willfully purged my Windows knowledge these past few years!

This is indeed what is happening here; the frame is larger than the commit size in the PE header (4KiB), so it's not triggering a fault on the guard page for Windows to commit more stack pages.

We should definitely be forcing a probe on Windows for any frame larger than 4 KiB at a minimum.

peterhuene commented 1 year ago

Also, I will note that the default thread stack size for a 64-bit .NET core apphost is 1.5 MiB and at the fault there was still plenty of wasm stack space left (hence why it succeeded the prologue overflow check), it just didn't probe to get Windows to commit more pages.

kpreisser commented 1 year ago

Thank you! When I modify the repro to allocate (and immediately free) 6 MiB on the stack in the thread when it is started, e.g. with the following code (note that .NET by default initializes stack memory to zero):

        private void RunWasmInitializerThread()
        {
            void AllocStack()
            {
                Span<byte> bytes = stackalloc byte[6 * 1024 * 1024];
            }
            AllocStack();
            // ...

then I'm indeed no longer able to reproduce the access violation.

peterhuene commented 1 year ago

FYI, I believe function 311 in the module to be the faulting function as it has thousands of locals (lol: wasm-tools print prints to column 150221 for the locals declaration)!

peterhuene commented 1 year ago

Here's a simplified wat that reproduces the problem. It has a function with a frame size of 80192 bytes and that consistently gets wasmtime repro.wat to crash:

C:\Users\User\src\wasmtime>target\release\wasmtime.exe repro.wat

C:\Users\User\src\wasmtime>echo %ERRORLEVEL%
-1073741819

By eliding the host compatibility check in the engine and enabling stack probing:

C:\Users\User\src\wasmtime>target\release\wasmtime.exe --cranelift-enable enable_probestack --cranelift-set probestack_strategy=inline repro.wat
warning: using `--invoke` with a function that returns values is experimental and may break in the future
0

I think we can confidently say that it's due to a missing probe for these large frame allocations.

kpreisser commented 1 year ago

The stack safety in Wasmtime relies on the embedder ensuring there's at least that much wasm stack available so if the new thread doesn't have 2MB+ of wasm stack then I think this could happen.

If I understand this correctly, this could be a problem in our application (maybe also Linux) since we don't know how many remaining space is available in the stack when a wasm function is called (because we might call them from events (callbacks) raised e.g. by external plugins or user script code, where an arbitrary number of frames may already be on the stack), so we can't ensure that there is at least the specified max wasm stack size available in the current thread's stack.

Does this mean this would generally require stack probing to be enabled in Cranelift also on non-Windows OSes to be on the safe side (to ensure large wasm frames don't accidentally access memory that doesn't belong to the current thread's stack)? Thanks!

alexcrichton commented 1 year ago

The bug here for Wasmtime is that on Windows we need to always enable stack probes since it's part of how the system works and we can't escape from that. Stack probes are optional for other hosts but can be useful for detecting situations where not enough native stack was allocated by the host.

For designing an embedding what you'll generally want to do is determine some watermark for how much stack host code will consume (both before wasm and as part of a host call called by wasm) and how much stack wasm can consume. You'll then want to make sure that threads executing wasm have access to the sum of those two numbers. If you get the numbers wrong then the application should be guaranteed to safely abort with stack probes enabled, but without stack probes wasm runs a risk of jumping over the guard page.

kpreisser commented 1 year ago

The bug here for Wasmtime is that on Windows we need to always enable stack probes since it's part of how the system works and we can't escape from that.

👍

For designing an embedding what you'll generally want to do is determine some watermark for how much stack host code will consume (both before wasm and as part of a host call called by wasm) and how much stack wasm can consume.

Ok, thank you! However, as described we don't really have a way to always guarantee the host doesn't use more than a specific amount of stack, because we don't have complete control over the host code/stack. I gather this means that we will need to enable stack probing also on other OSes in our application, to ensure wasm frames don't have a risk of jumping over the guard page, so that the application is safely aborted in such a case. Is there a way to enable stack probing in the Wasmtime C API? Thanks!

alexcrichton commented 1 year ago

Not yet, no, and you can't do that today due to the issue @jameysharp mentioned above. @peterhuene were you going to work on a patch to enable that?

Personally I think it's reasonable to unconditionally enable this for all supported platforms since stack probes shouldn't have much of a perf impact anyway. This is only implemented for x86_64 though and other platforms don't have inline stack probes implemented yet and outlined probing support would need more changes in Wasmtime we may not yet be ready for.

peterhuene commented 1 year ago

Yes, I'll create a patch to enable it unconditionally (for x86_64).

peterhuene commented 1 year ago

I've put #5350 up and verified that it addresses the crash you're seeing with your module.

kpreisser commented 1 year ago

Thanks a lot for fixing this!

Regarding the stack probing, for us it would be nice to also have inline probing for AArch64 (#4846). For example, on Windows Arm64 (when compiling Wasmtime for aarch64-pc-windows-msvc from #5350), the repro .wat still fails with the access violation; and also on Linux AArch64, this will help us to ensure we don't get a security problem when a wasm frame jumps over the guard page due to the stack being nearly exhausted.

alexcrichton commented 1 year ago

Indeed! I've opened https://github.com/bytecodealliance/wasmtime/pull/5353 to add support for aarch64

bytecodealliance / wasmtime

Access Violation in .NET application (with `wasmtime-dotnet`) when using current Wasmtime build #5340