Closed kpreisser closed 1 year ago
Thanks for this! I've read over the source and I don't have a Windows machine on-hand to reproduce with but I've been trying to debug by correlating what's happening in the compiled artifact using the addresses from the backtrace you have. Unfortunately though the image I'm compiling locally seems to have functions at different offsets so I can't get things to line up.
Reading over the code though it looks like this has threading involved, possible GC interactions with .NET, and additionally the reproduction you mention is nondeterministic. @peterhuene would you be able to take a look at this? I've encountered similar-ish issues with the Go bindings historically and it was an often an error in the bindings and how the bindings interacted with the GC so I'm curious if anything jumps out at you here.
Hi @alexcrichton thanks for your reply!
That's a good point about the GC, which also applies to the repro project. I modified the code of the repro project to ensure the Wasmtime objects (Config
, Engine
, Module
, Linker
, Store
) are kept alive while the wasm function is executed (like it is also the case in our real application), so that there shouldn't be any possibility for the GC to prematurely call the finalizer of those objects (or their referenced SafeHandle
s) when it detects that the references are no longer in use.
(Sometimes, the .NET runtime can even consider the this
pointer within an instance method (while it's still executing) to no longer be alive when this
isn't used in that method.)
The access violation still reproduces for me with these changes. I also tried putting break points in the various Handle.ReleaseHandle()
methods in Wasmtime.Dotnet
for the above mentioned objects (which call wasm_config_delete
, wasm_engine_delete
, wasmtime_module_delete
, wasmtime_linker_delete
, and wasmtime_store_delete
), and it seems they are not called when the access violation occurs; only when it doesn't occur they are getting called as expected.
Thanks!
@kpreisser thanks for the report and the additional details. I'll investigate this immediately and follow up when I have new information to share.
The AV is a result of a wasm frame overflowing the stack, where the frame size exceeds the guard page size (thus it tries to write to unmapped memory). It is passing the overflow check in the prologue.
I haven't looked too closely at the cranelift issue, but I suspect it's related (digging into it now).
Oh! I noticed the stack size increase which now that I think about it I think that may be larger than the default 1MB stack size on Windows?
The stack safety in Wasmtime relies on the embedder ensuring there's at least that much wasm stack available so if the new thread doesn't have 2MB+ of wasm stack then I think this could happen.
It repros with the default wasm stack size too (that is indeed a misconfiguration without also increasing the stack size when creating the thread).
Ah ok makes sense!
This is perhaps a case where Cranelift could help out where we could insert stack probes in addition to the stack check we have today where the probes could guarantee a segfault to deterministically exit. I don't think we'll want to catch this as a stack overflow trap but reporting this with a better error message I think could be useful.
I stand corrected again. It is still reproing with the default wasm stack size; the AV is inconsistent (probably getting some mapped writable memory beyond the guard page on some runs).
I'm digging into why it's not triggering the overflow trap.
Cranelift has an enable_probestack
option, along with a couple choices for probestack_strategy
and a knob for probestack_size_log2
. Wasmtime enforces that these options are disabled though, in check_compatible_with_shared_flag
.
Thank you for the investigation and updates!
Regarding the stack size, note that the access violation also occurs if I set the thread stack size to 8 MiB (8388608
), e.g. in the Thread
constructor, or by using editbin /stack:xyz
to set the default thread stack size (the editbin
command is actually what we do in our real application, where the access violation also occurs).
So I would think the 2 MiB stack size in Wasmtime should be fine when the thread's stack size is 8 MiB? (I have updated the repro project to set the thread's stack size to 8 MiB; I previously forgot this there.)
Thanks!
On Windows we need stackprobes anyways to grow the stack. As I understand it on Windows only a small part is committed by default and directly below it is a guard page which if accessed will cause the stack to grow a bit. If you access even a single page below the guard page you will get an AV even though it is still in the part of the memory reserved for the stack. cg_clif also had access violations on windows in some cases which it fixed by enabling inline stack probes. cc https://github.com/bjorn3/rustc_codegen_cranelift/issues/1261#issuecomment-1219307630
@bjorn3 thanks for the reminder as I've willfully purged my Windows knowledge these past few years!
This is indeed what is happening here; the frame is larger than the commit size in the PE header (4KiB), so it's not triggering a fault on the guard page for Windows to commit more stack pages.
We should definitely be forcing a probe on Windows for any frame larger than 4 KiB at a minimum.
Also, I will note that the default thread stack size for a 64-bit .NET core apphost is 1.5 MiB and at the fault there was still plenty of wasm stack space left (hence why it succeeded the prologue overflow check), it just didn't probe to get Windows to commit more pages.
Thank you! When I modify the repro to allocate (and immediately free) 6 MiB on the stack in the thread when it is started, e.g. with the following code (note that .NET by default initializes stack memory to zero):
private void RunWasmInitializerThread()
{
void AllocStack()
{
Span<byte> bytes = stackalloc byte[6 * 1024 * 1024];
}
AllocStack();
// ...
then I'm indeed no longer able to reproduce the access violation.
FYI, I believe function 311 in the module to be the faulting function as it has thousands of locals (lol: wasm-tools print
prints to column 150221 for the locals declaration)!
Here's a simplified wat that reproduces the problem. It has a function with a frame size of 80192 bytes and that consistently gets wasmtime repro.wat
to crash:
C:\Users\User\src\wasmtime>target\release\wasmtime.exe repro.wat
C:\Users\User\src\wasmtime>echo %ERRORLEVEL%
-1073741819
By eliding the host compatibility check in the engine and enabling stack probing:
C:\Users\User\src\wasmtime>target\release\wasmtime.exe --cranelift-enable enable_probestack --cranelift-set probestack_strategy=inline repro.wat
warning: using `--invoke` with a function that returns values is experimental and may break in the future
0
I think we can confidently say that it's due to a missing probe for these large frame allocations.
The stack safety in Wasmtime relies on the embedder ensuring there's at least that much wasm stack available so if the new thread doesn't have 2MB+ of wasm stack then I think this could happen.
If I understand this correctly, this could be a problem in our application (maybe also Linux) since we don't know how many remaining space is available in the stack when a wasm function is called (because we might call them from events (callbacks) raised e.g. by external plugins or user script code, where an arbitrary number of frames may already be on the stack), so we can't ensure that there is at least the specified max wasm stack size available in the current thread's stack.
Does this mean this would generally require stack probing to be enabled in Cranelift also on non-Windows OSes to be on the safe side (to ensure large wasm frames don't accidentally access memory that doesn't belong to the current thread's stack)? Thanks!
The bug here for Wasmtime is that on Windows we need to always enable stack probes since it's part of how the system works and we can't escape from that. Stack probes are optional for other hosts but can be useful for detecting situations where not enough native stack was allocated by the host.
For designing an embedding what you'll generally want to do is determine some watermark for how much stack host code will consume (both before wasm and as part of a host call called by wasm) and how much stack wasm can consume. You'll then want to make sure that threads executing wasm have access to the sum of those two numbers. If you get the numbers wrong then the application should be guaranteed to safely abort with stack probes enabled, but without stack probes wasm runs a risk of jumping over the guard page.
The bug here for Wasmtime is that on Windows we need to always enable stack probes since it's part of how the system works and we can't escape from that.
👍
For designing an embedding what you'll generally want to do is determine some watermark for how much stack host code will consume (both before wasm and as part of a host call called by wasm) and how much stack wasm can consume.
Ok, thank you! However, as described we don't really have a way to always guarantee the host doesn't use more than a specific amount of stack, because we don't have complete control over the host code/stack. I gather this means that we will need to enable stack probing also on other OSes in our application, to ensure wasm frames don't have a risk of jumping over the guard page, so that the application is safely aborted in such a case. Is there a way to enable stack probing in the Wasmtime C API? Thanks!
Not yet, no, and you can't do that today due to the issue @jameysharp mentioned above. @peterhuene were you going to work on a patch to enable that?
Personally I think it's reasonable to unconditionally enable this for all supported platforms since stack probes shouldn't have much of a perf impact anyway. This is only implemented for x86_64 though and other platforms don't have inline stack probes implemented yet and outlined probing support would need more changes in Wasmtime we may not yet be ready for.
Yes, I'll create a patch to enable it unconditionally (for x86_64).
I've put #5350 up and verified that it addresses the crash you're seeing with your module.
Thanks a lot for fixing this!
Regarding the stack probing, for us it would be nice to also have inline probing for AArch64 (#4846).
For example, on Windows Arm64 (when compiling Wasmtime for aarch64-pc-windows-msvc
from #5350), the repro .wat
still fails with the access violation; and also on Linux AArch64, this will help us to ensure we don't get a security problem when a wasm frame jumps over the guard page due to the stack being nearly exhausted.
Indeed! I've opened https://github.com/bytecodealliance/wasmtime/pull/5353 to add support for aarch64
Hi,
we are using
Wasmtime
viawasmtime-dotnet
in a .NET 6.0 application, where we load a 8 MB WASM file and execute its_start
function. However, this causes an access violation on Windows inwasmtime_func_call
when using Wasmtime from commit ff5abfd9938c78cd75c6c5d6a41565097128c5d3.The crash started to happen toegher with issue #5328 (with commit fc62d4ad65892837a1e1af5a9d67c8b0d4310efe), which is why I also reported it there, but it could be that this was just a trigger and the actual root case is a different one. It could also be that once #5328 is fixed, the access violation is no longer reproducible.
I can reproduce the access violation with the following steps on Windows 10 x64 Build 19044, and on Windows Server 2019:
x86_64-pc-windows-msvc
) as release build:cargo build --release --manifest-path crates/c-api/Cargo.toml
repro-wasmtime-access-violation
, which contains a commit that adds the repro project on top of the currentwasmtime-dotnet
code. (The code contains a lot of fields/variables and some methods that are not used, but that was as best as I could get it while still reproducing the crash.)cd
into the repo'sReproWasmtimeAccessViolation
folder, then rundotnet build ReproWasmtimeAccessViolation.csproj
.cd bin\Debug\net6.0
, then copywasmtime.dll
andwasmtime.pdb
from the build result into this directory.dotnet-codabix-wasm-build.wasm
also into this directory.ReproWasmtimeAccessViolation.exe
.After some seconds, the program will exit if the access violation occured (and in Windows Error Reporting, there will be an error entry in the Application log ("Application Error"). If it didn't occur, the program will print
"wasmtime_func_call completed! This should not be displayed if the Access Violation occured."
, and you can useCtrl
+C
to exit.Unfortunately, the crash doesn't alway occur, it might also depend on the hardware or other factors. You may have to try a few times if the crash doesn't occur, or try again at a later time.
When attaching Visual Studio 2022 as debugger (using "native code" mode), it will show the following:
Call Stack:
Additional notes:
Environments where I can reproduce the crash:
Thank you!