dotnet / runtimelab

This repo is for experimentation and exploring new ideas that may or may not make it into the main dotnet/runtime repo.
MIT License
1.43k stars 200 forks source link

Occasional SEHException in ObjWriter #1316

Closed kant2002 closed 10 months ago

kant2002 commented 3 years ago

I'm occasionally hit exceptions in ObjWriter. By occasion I mean maybe after update to nightlies, or after I do some modifications in application.

EXEC : error : External component has thrown an exception. 
  System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
     at ILCompiler.DependencyAnalysis.ObjectWriter.FinishObjWriter(IntPtr objWriter)
     at ILCompiler.DependencyAnalysis.ObjectWriter.Dispose(Boolean bDisposing) in ILCompiler.Compiler.dll:token 0x6000dd5+0x12
     at ILCompiler.DependencyAnalysis.ObjectWriter.EmitObject(String objectFilePath, IEnumerable`1 nodes, NodeFactory factory, ObjectWritingOptions options, IObjectDumper dumper) in ILCompiler.Compiler.dll:token 0x6000dda+0x367
     at ILCompiler.RyuJitCompilation.CompileInternal(String outputFile, ObjectDumper dumper) in ILCompiler.RyuJit.dll:token 0x600022b+0x35
     at ILCompiler.Compilation.ILCompiler.ICompilation.Compile(String outputFile, ObjectDumper dumper) in ILCompiler.Compiler.dll:token 0x6000322+0x9
     at ILCompiler.Program.Run(String[] args) in ilc.dll:token 0x600010d+0xc7e
     at ILCompiler.Program.Main(String[] args) in ilc.dll:token 0x6000113+0x5

or

 Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
  Repeat 2 times:
  --------------------------------
     at ILCompiler.DependencyAnalysis.ObjectWriter.FinishObjWriter(IntPtr)
  --------------------------------
     at ILCompiler.DependencyAnalysis.ObjectWriter.Dispose(Boolean)
     at ILCompiler.DependencyAnalysis.ObjectWriter.EmitObject(System.String, System.Collections.Generic.IEnumerable`1<ILCompiler.DependencyAnalysisFramework.DependencyNode>, ILCompiler.DependencyAnalysis.NodeFactory, ILCompiler.DependencyAnalysis.ObjectWritingOptions, ILCompiler.DependencyAnalysis.IObjectDumper)
     at ILCompiler.RyuJitCompilation.CompileInternal(System.String, ILCompiler.ObjectDumper)
     at ILCompiler.Compilation.ILCompiler.ICompilation.Compile(System.String, ILCompiler.ObjectDumper)
     at ILCompiler.Program.Run(System.String[])
     at ILCompiler.Program.Main(System.String[])

My problem when I hit this error, I have no idea what's going on, and how to intercept SEH to at least understand what method is offending codegen. What is worrying me is that this two errors appear on same inputs. And prior that, I have an issue and on next day it solve itself.

Questions:

jkotas commented 3 years ago

The way to debug crash like this are dumps: Enable full dump collection for ilc.exe in registry https://docs.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps , then look into the crash dump using windbg

kant2002 commented 3 years ago

Sigh. And thanks!

AntonLapounov commented 3 years ago

I would run ILC in a loop overnight waiting for a crash.

kant2002 commented 3 years ago

Given that I hit that again (it start working today) I think that's a good idea.

kant2002 commented 3 years ago

The way to debug crash like this are dumps:

Does not work. Seems to be it is intercepted in C# and does not trigger crash dump collection. Strangely on application it start working third time. Seems to be ancient gods still works. Any ideas what I can do?

jkotas commented 3 years ago

Seems to be it is intercepted in C# and does not trigger crash dump collection.

AccessViolationException should trigger the dump as expected. I think it is only a problem for SEHException on Windows. You can wrap the call FinishObjWriter with exception filter that calls FailFast that triggers the dump capture, like this:

try
{
    FinishObjWriter(_nativeObjectWriter);
}
catch (Exception e) when (UnexpectedExceptionFilter(e))
{
}
...
static bool UnexpectedExceptionFilter(Exception e)
{
    Environment.FailFast("Fatal error", e);
    return false;
}

If it works, you can submit a PR that adds this code permanently. It is not the first time we are trying to chase down an intermittent crash in FinishObjWriter.

kant2002 commented 2 years ago

I seems to be able catch this in WinDBG

Output in console

LLVM ERROR: out of memory
Allocation failed

Last lines in WinDBG which related

(5d04.3d60): Security check failure or stack buffer overrun - code c0000409 (!!! second chance !!!)
Subcode: 0x7 FAST_FAIL_FATAL_APP_EXIT 
ChangeEngineState
*** WARNING: Unable to verify checksum for ...\nuget\packages\runtime.win-x64.microsoft.dotnet.ilcompiler\7.0.0-alpha.1.21572.3\tools\objwriter.DLL
objwriter!SwitchSection+0x13b04d:
00007ffa`7b67256d cd29            int     29h

Seems to be I did not use debug builds, or something other stupid mistakes which I make. image

kant2002 commented 2 years ago

Is there location where I can grab PDB files, so I can take a look at the stack trace more closely.

jkotas commented 2 years ago

Sigh, we do not publish the PDB packages due to this workaround: https://github.com/dotnet/runtimelab/blob/feature/NativeAOT/src/installer/pkg/projects/Microsoft.DotNet.ILCompiler/Microsoft.DotNet.ILCompiler.pkgproj#L12

We will need to fix this workaround at some point. In the meantime, you can use locally built objwriter to investigate this crash.

kant2002 commented 2 years ago

I close Visual Studio and Edge and run under WinDBG with Debug ObjWriter.dll

ILC stuck within ObjWriter::Finish

Technical parameters which indicate size of data and performance characteristics of the code.

MCContext::Sections = { size= 186_791 }
MCContext::Symbols = { size=1_652_570 }

Inside MCContext::reset() COFFAllocator.DestroyAll(); waiting for more then 10 minutes to cleanup. I suspect this is infinite loop in (because it's sitting here)

  iterator iplist_impl::erase(iterator first, iterator last) {
    while (first != last)
      first = erase(first);
    return last;
  }

Also StringTableBuilder::multikeySort slow on 2M strings, like couple minuts to sort these strings.

Data from managed side: ILCompiler.CompilerTypeSystemContext

_arrayTypes.Count == 4_926
_validType.Count = 113_380

Questions:

  1. Is number of sections and symblols look extremely large, or just large?
  2. Can we issue warning from compiler before start compilation to binary, something like that You have too many types generated, you may experience slow compilation times
kant2002 commented 2 years ago

That's not infinite loop. After maybe 30 minutes, or even an hour, COFFAllocator.DestroyAll() do it's work. Second place where I wait long amount of time is CVContext.reset(); (more then hour as of now), all inside MCContext::reset().

kant2002 commented 2 years ago

Let me check with release. Maybe SEHException which I was seeing was OOM all the time.

kant2002 commented 2 years ago

I still did not able to test with Release. But I manage finish application under Debug mode. Same MCContext::reset() two more problematic friends.

COFFUniquingMap.clear(); // 2 hours in debug.
NextID.clear(); // at least 3 hours, then after that finish overnight.

In short cleanup take at least 7 hours.

Below list of slow to deallocate types declarations. I record that in case this should go to LLVM and perf tested there.

std::map<COFFSectionKey, MCSectionCOFF *> COFFUniquingMap

https://github.com/llvm/llvm-project/blob/e356027016c6365b3d8924f54c33e2c63d931492/llvm/include/llvm/MC/MCContext.h#L327

StringMap NextID

https://github.com/llvm/llvm-project/blob/e356027016c6365b3d8924f54c33e2c63d931492/llvm/include/llvm/MC/MCContext.h#L145

SpecificBumpPtrAllocator COFFAllocator

https://github.com/llvm/llvm-project/blob/e356027016c6365b3d8924f54c33e2c63d931492/llvm/include/llvm/MC/MCContext.h#L115

_std::uniqueptr CVContext

https://github.com/llvm/llvm-project/blob/e356027016c6365b3d8924f54c33e2c63d931492/llvm/include/llvm/MC/MCContext.h#L107

kant2002 commented 2 years ago

In addition to the fact that obj file produced has size 530Mb, when run linker it produce following error

VideoMonitoring.obj : error LNK2001: unresolved external symbol RhpGetThreadStaticBaseForType

Is this some mismatch between 7.0.0-alpha.1.21572.3 version and version with Custom marshalers (which was created a bit earlier)?

jkotas commented 2 years ago

RhpGetThreadStaticBaseForType

Yes, this was added recently. If you are seeing this link error, it means you have mismatched bits.

kant2002 commented 2 years ago

Okay. Seems to be in my case issue happens when I hit OOM. I reproduce issue with Release version of ObjWriter.dll If I run under WinDBG whole process is very slow. Couple hours at least. If I run without debugger, it's 4 minutes.

Still because it take 9Gb to run, I do not hit OOM consistently that's why I did not reflect that. Not sure if this is actionable. ot sure, if there other issue hidden somewhere.

Is there something which can be done about it?

kant2002 commented 2 years ago

I'm not sure that this is related, but seems pretty close.

I was trying to compile https://github.com/space-wizards/space-station-14/tree/master/Content.Server and ILC eat up to 25Gb after that crash.

MSBUILD : error MSB4166: Child node "4" exited prematurely. Shutting down. Diagnostic information may be found in files in "C:\Users\kant\AppData\Local\Temp\" and will be named MSBuild_*.failure.txt. This location can be changed by setting the MSBUILDDEBUGPATH environment variable to a different directory.

ILC produce a lot of AOT warnings, and use probably all what's forbidden due to dynamic generation, but I still see that jumping memory usage up to 25Gb is somewhat unexpected for end-user.

from what I see previously that's from types name strings. Maybe introducing something like name mangling, or some other way to address that?

filipnavara commented 10 months ago

Perhaps we can close this now with the new ObjWriter landing for .NET 9?

kant2002 commented 10 months ago

I think yes. I probably would like to take a look how new ObjWriter behave under 16Gb of RAM because it may be quite possible we change SEH to OOM

filipnavara commented 10 months ago

The memory usage should be better, but obviously any testing is welcome, and there is still room for improvement.

jkotas commented 10 months ago

Let's close this and open new issues if necessary.