dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

[Mono][Wasm][Interp] Assertion fail in Marshal.GetFunctionPointerForDelegate #101736

Open Nemo-G opened 6 months ago

Nemo-G commented 6 months ago

Description

I was testing a project using Marshal.GetFunctionPointerForDelegate in both Debug and Release mode. The code fragment causing problem is:

               // public delegate void ForwardingFunc(IntPtr systemPtr, IntPtr state);

#if DIRECT_CALL 
// Result in assertion failure in release mode
                defeatGc = (ulong)GCHandle.ToIntPtr(GCHandle.Alloc(managedFn));
                result = (ulong)Marshal.GetFunctionPointerForDelegate(managedFn);
#else
// Mashal with wrapper func
// Result in NULL orig_method in interp_create_method_pointer
               ForwardingFunc wrapper = (IntPtr system, IntPtr state) =>
                {
                    try
                    {
                        managedFn(system, state);
                    }
                    catch (Exception ex)
                    {
                        Debug.LogException(ex);
                    }
                };
                defeatGc = (ulong)GCHandle.ToIntPtr(GCHandle.Alloc(wrapper));
                result = (ulong)Marshal.GetFunctionPointerForDelegate(wrapper);
#endif

I first tested with wrapper func in Debug mode. It failed to find the native2managed function because orig_method is NULL in https://github.com/dotnet/runtime/blob/ca4f0fe37455882baa00c75b1ef30a7ff1494457/src/mono/mono/mini/interp/interp.c#L3327 (I can understand somehow because both m2n or n2m function in pinvoke-table.h was generated at compile time. Lambda function may not contain necessary info. But is it a bug? )

So I removed the wrapper and marshal managedFn directly. I got no problem in Debug mode. However, I met assertion failure in Release mode: https://github.com/dotnet/runtime/blob/ca4f0fe37455882baa00c75b1ef30a7ff1494457/src/mono/mono/mini/interp/transform.c#L9007 (local_ref_count == 0 at the time) (Normally a fatal error won't give me exact mono trace so that I changed the related runtime code by setting mono_error_set_xxx to continue the program. This helped me to get the managed stacktraces that caused the problem.)

And after that I did some tests by comparing what made the difference between Debug and Release. Eventually I located the interp optimization options because I found the code path in Debug and Release mode was totally different. With "-jiterp,-bblocks" disabled, everything seems fine.

As I'm not an expert in mono, I do not quite understand the optimization logic which makes the debugging full of struggle. Is there any documents regarding the interp optimization logic? Help is needed!

Reproduction Steps

Just by called the managed code(after mono runtime is initialized of course)

Expected behavior

No error

Actual behavior

Assertion failed and program was terminated

Regression?

I've only tested in .NET 8. Not sure if the behaviour is the same in .NET9

Known Workarounds

Disable part of interp_opts. The minimum combination is to disable INTERP_OPT_JITERPRETER and INTERP_OPT_BBLOCKS. It works fine in debug mode which I checked the source code that no interp_opts is enabled.

Configuration

.NET8 Wasm Tested in Chrome/Edge (v8) I haven't tested against Webkit, but I guess this may not be related to browser.

Other information

No response

Nemo-G commented 6 months ago

Add some extra info: Compile dotnet runtime under single thread. Build without SIMD enabled

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @brzvlad, @kotlarmilos See info in area-owners.md if you want to be subscribed.

Nemo-G commented 6 months ago

I'm going to provide more info here. I located the method to marshal and turned on MONO_VERBOSE_METHOD. And I hacked in interp_local_deadce like image

Here's the IL diff containing jiterp&bblocks ON vs OFF. https://www.diffchecker.com/4MXLTBTw/

It seems certain instructions was taken as dead but should be kept? And the mis-removal would cause problem in jiterp optimization?

In fact, with the hacking above. The program would work with only JITERP disabled

@BrzVlad I completely understand that you might be swamped. But would you take a look and provide some insights when you have time? Thanks in advance!

BrzVlad commented 6 months ago

Thank you for taking the time to look into it. From the verbose log, it looks like the ldloca is generated with the incorrect source. The prefix1 instruction from inlined method is most likely a ldarga. All roads point to this being a known issue: https://github.com/dotnet/runtime/pull/97650.

Are you sure you are using the latest net8 release ?

Nemo-G commented 6 months ago

@BrzVlad Thanks for the inputs! We are indeed not working on the latest net8 but a fork with Unity specific changes. I will apply the patch and get you with the results later.

Nemo-G commented 6 months ago

With the fix I can no longer see the BBLOCK optimization errors. Thanks again for the kind help! @BrzVlad

But still the program is not running correctly unless I disable INTERP_OPT_JITERPRETER. I got memory access OOB with callstack image

I checked the method value in mono_runtime_invoke and found the related method is

    static class AutomaticWorldBootstrap
    {
        [RuntimeInitializeOnLoadMethod(RuntimeInitializeLoadType.BeforeSceneLoad)]
        static void Initialize()
        {
            DefaultWorldInitialization.Initialize("Default World", false);
        }
    }

// attr definition
    public enum RuntimeInitializeLoadType
    {
        AfterSceneLoad  = 0,
        BeforeSceneLoad,
        AfterAssembliesLoaded,
        BeforeSplashScreen,
        SubsystemRegistration
    }

    [Scripting.RequiredByNativeCode]
    [System.AttributeUsage(AttributeTargets.Method, AllowMultiple = false)]
    public class RuntimeInitializeOnLoadMethodAttribute : Scripting.PreserveAttribute
    {
        public RuntimeInitializeOnLoadMethodAttribute() { this.loadType = RuntimeInitializeLoadType.AfterSceneLoad; }
        public RuntimeInitializeOnLoadMethodAttribute(RuntimeInitializeLoadType loadType) { this.loadType = loadType; }

        public RuntimeInitializeLoadType loadType
        {
            get { return m_LoadType; } private set { m_LoadType = value; }
        }

        RuntimeInitializeLoadType m_LoadType;
    }

Although the stack evaluation failed in IsUserCattrProvider, I can't make sure if it is the consequence or the root cause. What info can I provide regarding the jiterpreter logic in order to help further debug?

Nemo-G commented 6 months ago

d62ff0ba2487e15b0d62ecadb7471019682b61d0 This is the only commit my branch doesn't have but I think it is not related.

And here is the mono verbose and jiterp stats log before MOOB

MONO_WASM: // jitted 471 bytes; 0 traces (0.0%) (0 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 2, fused: 1; back-branches emitted: 0, failed: 1 (0.0%); 
MONO_WASM: // time: 0ms generating, 0ms compiling wasm.
MONO_WASM: // jitted 942 bytes; 1 traces (50.0%) (0 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 4, fused: 2; back-branches emitted: 0, failed: 2 (0.0%); 
MONO_WASM: // time: 3ms generating, 1ms compiling wasm.
dotnet.native.8.0.0.pi5hm4ztm8.js:8 Input Manager initialize...

dotnet.native.8.0.0.pi5hm4ztm8.js:8 UnloadTime: 0.700000 ms

MONO_WASM: // jitted 1267 bytes; 2 traces (66.7%) (0 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 4, fused: 2; back-branches emitted: 0, failed: 2 (0.0%); 
MONO_WASM: // time: 3ms generating, 3ms compiling wasm.
MONO_WASM: // jitted 1573 bytes; 3 traces (75.0%) (0 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 5, fused: 2; back-branches emitted: 0, failed: 2 (0.0%); 
MONO_WASM: // time: 4ms generating, 4ms compiling wasm.
MONO_WASM: // jitted 1974 bytes; 4 traces (80.0%) (0 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 5, fused: 4; back-branches emitted: 0, failed: 2 (0.0%); 
MONO_WASM: // time: 4ms generating, 5ms compiling wasm.
MONO_WASM: // jitted 2321 bytes; 5 traces (83.3%) (0 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 7, fused: 5; back-branches emitted: 0, failed: 2 (0.0%); 
MONO_WASM: // time: 5ms generating, 7ms compiling wasm.
MONO_WASM: // jitted 2679 bytes; 6 traces (85.7%) (1 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 7, fused: 5; back-branches emitted: 0, failed: 2 (0.0%); 
MONO_WASM: // time: 5ms generating, 8ms compiling wasm.
MONO_WASM: // jitted 3935 bytes; 7 traces (87.5%) (1 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 8, fused: 17; back-branches emitted: 0, failed: 2 (0.0%); 
MONO_WASM: // time: 5ms generating, 9ms compiling wasm.
MONO_WASM: // jitted 4543 bytes; 8 traces (88.9%) (1 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 8, fused: 20; back-branches emitted: 0, failed: 2 (0.0%); 
MONO_WASM: // time: 6ms generating, 10ms compiling wasm.
MONO_WASM: // jitted 5000 bytes; 9 traces (90.0%) (2 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 10, fused: 21; back-branches emitted: 0, failed: 3 (0.0%); 
MONO_WASM: // time: 7ms generating, 11ms compiling wasm.
MONO_WASM: // jitted 5474 bytes; 10 traces (90.9%) (2 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 14, fused: 21; back-branches emitted: 0, failed: 3 (0.0%); 
MONO_WASM: // time: 7ms generating, 13ms compiling wasm.
MONO_WASM: // jitted 6292 bytes; 11 traces (91.7%) (2 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 14, fused: 26; back-branches emitted: 0, failed: 3 (0.0%); 
MONO_WASM: // time: 7ms generating, 14ms compiling wasm.
MONO_WASM: // jitted 6705 bytes; 12 traces (92.3%) (2 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 14, fused: 27; back-branches emitted: 1, failed: 3 (25.0%); 
MONO_WASM: // time: 8ms generating, 15ms compiling wasm.
MONO_WASM: // jitted 7195 bytes; 13 traces (92.9%) (2 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 15, fused: 28; back-branches emitted: 2, failed: 4 (33.3%); 
MONO_WASM: // time: 9ms generating, 16ms compiling wasm.
MONO_WASM: // jitted 7597 bytes; 14 traces (93.3%) (2 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 18, fused: 28; back-branches emitted: 2, failed: 4 (33.3%); 
MONO_WASM: // time: 9ms generating, 18ms compiling wasm.
MONO_WASM: // jitted 7977 bytes; 15 traces (93.8%) (2 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 21, fused: 28; back-branches emitted: 2, failed: 4 (33.3%); 
MONO_WASM: // time: 9ms generating, 19ms compiling wasm.
MONO_WASM: // jitted 9076 bytes; 16 traces (88.9%) (3 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 30, fused: 28; back-branches emitted: 2, failed: 4 (33.3%); 
MONO_WASM: // time: 10ms generating, 20ms compiling wasm.
MONO_WASM: // jitted 11112 bytes; 17 traces (89.5%) (3 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 52, fused: 28; back-branches emitted: 2, failed: 4 (33.3%); 
MONO_WASM: // time: 10ms generating, 21ms compiling wasm.
MONO_WASM: // jitted 11362 bytes; 18 traces (90.0%) (4 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 55, fused: 28; back-branches emitted: 2, failed: 4 (33.3%); 
MONO_WASM: // time: 11ms generating, 22ms compiling wasm.
MONO_WASM: // jitted 11985 bytes; 19 traces (90.5%) (4 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 60, fused: 29; back-branches emitted: 3, failed: 4 (42.9%); 
MONO_WASM: // time: 11ms generating, 24ms compiling wasm.
Method Unity.Entities.AutomaticWorldBootstrap:Initialize (), optimized 0, original code:
IL_0000: ldstr     "Default World"
IL_0005: ldc.i4.0  
IL_0006: call      0x0a0002a0
IL_000b: pop       
IL_000c: ret       

IL_0000 ldstr     , sp 0,                
IL_0005 ldc.i4.0  , sp 1, O  String      
IL_0006 call      , sp 2, I4             
IL_000b pop       , sp 1, O  World       
IL_000c ret       , sp 0,                
Unoptimized IR:
BB0:
IL_ffffffff: tier_enter_method [nil <- nil],
BB1:
IL_0000: ldstr          [0 <- nil], 0
IL_0005: ldc.i4.0       [1 <- nil],
IL_0006: call           [2 <- c:], Unity.Entities.DefaultWorldInitialization:Initialize (string,bool)
IL_000c: ret.void       [nil <- nil],
Runtime method: Unity.Entities.AutomaticWorldBootstrap:Initialize () 0x6984cc
Locals size 0
Calculated stack height: 2, stated height: 8
IR_0000: tier_enter_method [nil <- nil],
IR_0001: ldstr          [0 <- nil], 0
IR_0004: ldc.i4.0       [8 <- nil],
IR_0006: call           [0 <- 0], Unity.Entities.DefaultWorldInitialization:Initialize (string,bool)
IR_000a: ret.void       [nil <- nil],

SEQ POINT MAP FOR Initialize: 
MONO_WASM: // jitted 12518 bytes; 20 traces (90.9%) (4 rejected); 0 jit_calls; 0 interp_entries
MONO_WASM: // cknulls eliminated: 60, fused: 30; back-branches emitted: 4, failed: 6 (40.0%); 
MONO_WASM: // time: 11ms generating, 25ms compiling wasm.
dotnet.native.8.0.0.pi5hm4ztm8.js:8 [Test] LoadingSystem.OnCreate invoked

dotnet.native.8.0.0.pi5hm4ztm8.js:8 RuntimeError: memory access out of bounds
BrzVlad commented 6 months ago

There have been some serious changes in this area in .net9. It would be worthwhile to try it out. Maybe @kg can provide some help with diagnosing this latest failure

kg commented 6 months ago

As a starting point, you could use MONO_VERBOSE_METHOD to dump IsUserCattrProvider. The jiterpreter should automatically engage its diagnostics for the method and we can examine what it's doing to see if it makes sense.

Nemo-G commented 6 months ago

@BrzVlad Thanks for the suggestion. In fact we started our project investigation with net9 alpha. The only reason we decided to use net8 is that it is officially released at the time. When do you think that net9 can be considered stable? We will definitely come back to integrate with net9 by then.

localhost-1715442485731.log @kg Here's the log with IsUserCattrProvider enabled.

kg commented 6 months ago

Is there a reason why it says dotnet.runtime.8.0.0.90q4mwuy84.js? Are you not able to use the latest version of 8.0? Will still investigate, but you may be missing bug fixes.

kg commented 6 months ago

Also, are you running any third party WASM code in your environment? This crash appears to indicate memory corruption, specifically the reserved low memory region in the WASM heap that's supposed to be all zeroes. If the low memory region is corrupted when the jiterpreter is enabled, that can turn NullReferenceExceptions into 'memory access out of bounds' errors. I refer to this area as the "zero page".

image

If the zero page is being corrupted, you can disable zero page optimizations by turning off the jiterpreter-zero-page-optimization option. See https://github.com/dotnet/runtime/blob/main/docs/design/mono/jiterpreter.md#configuring-the-jiterpreter for information on how to configure the jiterpreter.

You could also disable null check elimination, but in this case I don't think that's related.

If I found the correct implementation of IsUserCattrProvider, this would potentially occur if the obj passed into it is not a Type. The first isinst will produce null, and the second isinst will be type-checking null (which is safe with zero page optimizations as long as the heap isn't corrupted).

You could also examine the bottom of WASM memory in the devtools debugger when this crash happens. It should be all zeroes.

Nemo-G commented 6 months ago

@kg Yes there are third party natives, I'm working on Unity :) My team started to add a lot Unity specific changes on top of NET8 to make it work on WebGL platform without AOT (to keep a lower memory on mobile devices). In fact this issue comes from one of our clients. As we need a stable version and not sure if any recurssion issue may be introduced when applying the latest change of NET8, we just let the net8 stay on a specific version unless there are problems. And only the wasm changes are necessary for us.

I will try the jiterp options first and try to merge the latest NET8 later.

Nemo-G commented 6 months ago

Confirmed that there is no MOOB when jiterpreter-zero-page-optimization is turned off.

kg commented 6 months ago

Confirmed that there is no MOOB when jiterpreter-zero-page-optimization is turned off.

For the interim it should be safe to just run with that option turned off, and the performance penalty is pretty small (something like 1-5% for jiterpreter traces that do specific operations).

Nemo-G commented 6 months ago

For the interim it should be safe to just run with that option turned off, and the performance penalty is pretty small (something like 1-5% for jiterpreter traces that do specific operations).

Will try that temporarily for this specific project. Thanks for the help! But from your analysis, it seems there is an NPE which corrupts the 'zeroPage'. Instead of disabling the zero page opts, is there an 'official way' to solve this problem? I mean in what directions that I can try to analyze further? If there are other solutions, I definitely do not want to disable zero page opts by default for every clients.

kg commented 6 months ago

I do not know of any good way to detect zero page corruption other than frequently verifying the zero page after invoking third party code. The jiterpreter does some basic checks of the zero page in order to identify this kind of corruption, and it will shut off the optimizations if it detects it. But it can't always detect it.

Also, I believe we pass --low-memory-unused to wasm-opt when optimizing the runtime, so the runtime itself may end up containing similar zero page optimizations in its C. I'm not sure when or how wasm-opt exploits this property. I think clang may also have some optimizations of this sort in it. So you may potentially experience related crashes in other code.

Nemo-G commented 6 months ago

You are correct. I just checked the build binlog and found that -O3 --low-memory-unused --zero-filled-memory... were passed to wasm-opt.

As for the 3rd party codes, except for Unity engine native modules, 3rd party native libs are allowed to be involved in the build process. If this crash can be triggered by any 3rd party native(wasm), it seems disabling the zero page optimization would be the only way to ensure the game won't crash.