Closed radical closed 9 months ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | radical |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
log - full stack trace: https://gist.github.com/radical/8ba41143e077b8b6eab371237c1756aa
@jcouv, thanks for quickly disabling the test. Are you looking into the rootcause of the failure?
@jcouv My attempts at disabling the tests didn't quite work. Could you please take a look?
Thanks for the ping. I'm taking a look. I wasn't able to repro yet.
@radical I'm trying to repro this issue, but am unable so far.
Basically, I copied the deeply nested code from RngchkStress2.cs
and put in a compiler unittest. Then I call the API (SemanticModel.GetOperation
) reported in the overflow stack trace. (btw, thanks for that stack trace)
I'm able to do that in debug mode, which normally consumes a lot more stack. But I get no overflow.
Looking at the code involved in the stack trace (CSharpOperationFactory
, which is a big visitor), I see that code has not changed since January, so it feels unlikely that we've increased the stack consumption there.
Some thoughts so far:
CSharpOperationFactory
has changed, but that also feels unlikely (I looked at the history of "Operation" tests, but saw nothing significant)#error version
in the problematic file would help confirm the version that is used (it will be output as an error message)#error version
before the change would help confirm that (I expect it will be the same as after the change)This was hit on https://github.com/dotnet/runtime/pull/87522 .
This PR upgraded the .NET SDK used to build the repo to .NET 8 Preview 5.
This intermittent build error is likely caused by tiered PGO instrumentation that was enabled in .NET 8 Preview 5 by default. The PGO instrumentation may cause program to use different amount of stack space from run to run.
Thanks @jkotas, that makes sense. Unassigned myself and marked the issue as "untriaged" so that it can be assigned to a new owner.
cc @jakobbotsch (I think this will go away with new PGO collection/update)
I think we should just simplify the test if it is expected that Roslyn will rely on stack for deeply nested input programs. Note that the failures are on osx-x64 which, IIRC, have threads with notoriously small stack limits.
It is certainly expected from our side that new JIT changes may cause generated native code to use more stack. In this case it seems realistic that it is related to tiered PGO being enabled in preview 5. I would guess the source is new inlining as a result of having PGO data available.
FWIW, the test has 193 nested for loops in it, and the stack trace posted above contains CreateBoundForStatementOperation
179 times. So let's just remove 50 levels of nesting.
I also hit this on HugeArray1.cs in https://github.com/dotnet/runtime/pull/89858
/Users/runner/work/1/s/src/tests/JIT/jit64/opt/cse/HugeArray1.cs(10749,63): error CS8078: An expression is too long or complex to compile [/Users/runner/work/1/s/src/tests/JIT/jit64/opt/cse/HugeArray1.csproj]
I suspect there is something wrong with RuntimeHelpers.EnsureSufficientExecutionStack()
function - when I make it no-op I am able to reproduce Stack Overflow with only 200kb of stack (new Thread(DoWork, 200*1024)
) and PGO doesn't affect it (crashes on both PGO=1 and PGO=0)
So, on Apple it seems we only get 364kb of stack (it's m_CacheStackBase - m_CacheStackLimit) + EnsureSufficientExecutionStack
(that Roslyn actively uses) fails if there is less than 128kb of stack left so we only have 364-128=236kb of stack available before RuntimeHelpers.EnsureSufficientExecutionStack
will start throwing exceptions (and Roslyn will start complaining that a program is too complext to compile).
Main thread still has 8mb of stack. Perhaps, Roslyn should spawn threads with custom size of stack? (at least on Apple) cc @jkotas @jaredpar @jakobbotsch
Perhaps, Roslyn should spawn threads with custom size of stack?
It won't fix the problem with compilation of deeply nested statements failing intermittently due to non-determinism of PGO and tiered compilation. It will just move the threshold where the compilation becomes unreliable.
I changed exception type in EnsureSufficientExecutionStack
to BadImageException
just to crash Roslyn and get a call stack. Unfortunately, I wasn't able to install sos - lldb complains on dylib this file does not represent a loadable dylib
(although. I've checked that it's a correct arm64 binary) 🙁 The problem looks to be https://developer.apple.com/forums/thread/729765 that @hoyosjs already filed
@EgorBo do you have a copy of the dump?
Perhaps, Roslyn should spawn threads with custom size of stack?
It won't fix the problem with compilation of deeply nested statements failing intermittently due to non-determinism of PGO and tiered compilation. It will just move the threshold where the compilation becomes unreliable.
Sure, I just want to make sure that none of methods in the trace have unusual stack consumption due to PGO. What I'm saying that 236kb of stack just sounds too low for a code compilation task that involves recursions
@EgorBo do you have a copy of the dump?
I guess I can just follow your advice and compile lldb from sources locally to get sos working
Yes, I was able to load SOS with locally built LLDB (from https://github.com/llvm/llvm-project)
Here is the managed stack with stack pointers - the biggest delta between two pointers for managed calls is 544 bytes so nothing unusual, PGO also only contributes a few kb on top of it.
https://gist.github.com/EgorBo/ddf3d254d840bd61174b43d3527929fb
Summary:
The problem looks to be macOS specific, when Roslyn compiles one of these tests it consumes too much of stack and fails (it doesn't crash with the unrecoverable StackOverflow because it relies on EnsureSufficientExecutionStack()
API).
The typical default stack size for 64bit OSes looks like this: | OS | Main thread | Secondary threads |
---|---|---|---|
Windows | 1.5Mb* | 1.5Mb* | |
Linux | 8Mb | 8Mb | |
macOS | 8Mb | 512Kb |
(* - overriden by .NET)
These thresholds may vary (esp on linux) or can be configured OS-wise. There is also ability to se stack size directly via new Thread(..., stackSize)
or globally via DOTNET_DefaultStackSize
and DEFAULT_STACK_SIZE
)
So, for macOS we only have 512Kb of stack, 128Kb are reseved to be able to show traces for SO (we only get 384kb here with m_CacheStackBase - m_CacheStackLimit), then EnsureSufficientExecutionStack()
requires at least 128kb of stack available. Then, VM's functions (e.g. to kick off a new thread) may occupy several kbs on the call stack so we're left with only ~230kb of stack for Roslyn to do the job. It looks like PGO increases stack consumption for some functions but nothing unusual (none of functions consume more than 544 bytes of stack).
Possible actions:
1) Disable these tests on macOS only (some of them are currently disabled globally)
2) Use e.g. DOTNET_DefaultStackSize=0x100000
(1Mb) for macOS in those tests (tricky for tests groups?)
3) Simplify them (not sure how do that for HugeArray1.cs and hugeSimpleExpr1.cs)
4) Roslyn might want to increase stack size for macOS too - ~230kb sounds too low for such a heavy task
5) Propose an API RuntimeHelpers.EnsureSufficientExecutionStack(long size)
to override the default, a bit too conservative, 128kb limit (or request more if a user uses large stackallocs)
6) Override the default stack size on macOS to some fixed number (at least for the secondary threads)
I personally would like RuntimeHelpers.EnsureSufficientExecutionStack(long size)
. 128k is quite a lot, but it's probably not a good idea to change it. This way I would be able to check something more meaningful than "is there enough stack that I could run basically any main function in a sensible program which doesn't have a stackalloc bug". Ideally I would actually be able to check "is there enough room for my actual function, which I can guess probably doesn't quite need 128kB of stack".
From my experimentation and checking the PE file header, the stack size on Windows seems to be 1.5MB for .NET executables on both main and secondary threads, but for standard Windows executables it's just 1MB by default (for all threads I believe, but I've only tested main for native). Is there a reason to not just increase the secondary thread default on macOS to 1.5/1MB, since we're already not following the default on Windows? Note that the memory should only be reserved until it's used, so this shouldn't even realistically increase memory usage even.
Or we could standardise across these platforms to some specific number (e.g. 2/4/8MB), at least for the secondary threads (you can change it for primary thread on windows in the optional part of the PE header - which is how we set it presumably as the number matches, but I have no idea if similar options exist for macOS and Linux).
Ideally I would actually be able to check "is there enough room for my actual function, which I can guess probably doesn't quite need 128kB of stack".
The 128kB limit is there for exception handling, garbage collection and JIT to run on the thread in addition to your function.
Is there a reason to not just increase the secondary thread default on macOS to 1.5/1MB,
IIRC, it was discussed before but I am not able to find the paper trail. @janvorli Do you remember?
I think .Net should just guarantee some minimum stack size (like 2MB) on all platforms for consistency.
Moving to 9.0.0 as, I guess, there is nothing we can do for .NET 8.0 here?
Roslyn might want to increase stack size for macOS too - ~230kb sounds too low for such a heavy task
Disagree with this direction. What makes Roslyn special compared to any other application that uses the thread pool? Other than it's blocking a runtime test?
Disagree with this direction. What makes Roslyn special compared to any other application that uses the thread pool? Other than it's blocking a runtime test?
Roslyn is a bit special in this regard since it has unbounded recursion that is dependent on user provided input.
Roslyn is a bit special in this regard since it has unbounded recursion that is dependent on user provided input.
Guessing Roslyn is not alone here.
Guessing Roslyn is not alone here.
Agreed, but I would expect this to be a much smaller class of applications than applications that use the thread pool. In any case the to idea make the stack limits consistent across platforms seems good to me.
I would like to understand what exactly has changed since we are seeing this while building a number of different tests, even some innocent looking ones. I hit this while building https://github.com/dotnet/runtime/blob/261e3a40cd9faa240da82aacd8d821469703e339/src/tests/JIT/Methodical/VT/etc/ctor_recurse.cs locally today, and I have seen it on https://github.com/dotnet/runtime/blob/1aded3f65f6fd13a60eb43c191f4ea8aa27bf5dc/src/tests/JIT/Regression/JitBlue/Runtime_85602/Runtime_85602.cs in CI recently as well. These do not look super complex, so it's a weird coincidence if nothing major has changed.
Historically we've seen this due to a combination of factors:
Taken together this can lead to the compiler blowing its stack where it previously did not. This typically plays out in a given .NET release by the following sequence of events:
Runtime makes change that improves inlining. That change gets merged into a .NET SDK. The darc PR to merge that SDK into dotnet/roslyn gets blocked because it breaks the EndToEnd tests we have that verify our nesting limits.
Pretty sure we've seen this play out in runtime before but it's been a while.
These do not look super complex, so it's a weird coincidence if nothing major has changed.
I agree that neither of those look very complex. Those hitting the limits would, in my mind, be cause for investigation as to what changed.
FWIW the compiler actually got significantly better in .NET 8 in terms of level of nesting we can handle. In most cases it improved by 4X compared to .NET 7. So this should actually be going the other direction. Always possible you hit a nesting path we don't test though.
Minimal repro (it blows up even on Windows after certain iteration - looks like Tier0 codegen is fine):
Prerequsites:
RngchkStress2.cs
<PackageReference Include="Microsoft.CodeAnalysis" Version="4.7.0-2.final" />
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.Emit;
class Program
{
const string repo = @"C:/prj/runtime";
const string config = "windows.x64.Release";
const string coreRoot = repo + $@"/artifacts/tests/coreclr/{config}/Tests/Core_Root/";
const string sourceFilePath = repo + @"/src/tests/JIT/jit64/opt/rngchk/RngchkStress2.cs";
static async Task Main()
{
string sourceCode = File.ReadAllText(sourceFilePath);
for (int i = 0; i < 100; i++)
{
const int stackSize = 384 * 1024;
var thread = new Thread(() => CompileToDll(sourceCode, "Output.dll"), stackSize);
thread.Start();
await Task.Delay(100);
thread.Join();
}
Console.WriteLine("Started compilation on background thread...");
Console.ReadLine();
}
private static void CompileToDll(string sourceCode, string outputDllPath)
{
SyntaxTree syntaxTree = CSharpSyntaxTree.ParseText(sourceCode);
string assemblyName = Path.GetRandomFileName();
CSharpCompilation compilation = CSharpCompilation.Create(
assemblyName,
syntaxTrees: new[] { syntaxTree },
references: new[] {
MetadataReference.CreateFromFile(coreRoot + "System.Private.CoreLib.dll"),
MetadataReference.CreateFromFile(coreRoot + "System.Runtime.dll"),
MetadataReference.CreateFromFile(coreRoot + "System.Console.dll")
},
options: new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary));
EmitResult result = compilation.Emit(outputDllPath);
if (result.Success)
{
Console.WriteLine($"DLL was successfully created at: {outputDllPath}");
}
else
{
Console.WriteLine("Compilation failed!");
foreach (Diagnostic diagnostic in result.Diagnostics)
if (diagnostic.Severity == DiagnosticSeverity.Error)
Console.WriteLine(diagnostic.GetMessage());
}
}
}
One thing I noticed that If I spawn a new Thread
with stackSize=512 on Windows then I get (int64_t)m_CacheStackBase - (int64_t)m_CacheStackLimit;
= 512kb so exactly what is requested. On macOS at the same time I get 384kb
here for secondary threads when I don't specify any limit
@EgorBo PAL doesn't attempt to override the default stack size except for MUSL Linuxes where it is so low that it is unusable for .NET (tens of kilobytes). But if the size is specified when creating the thread, it should be honored. I am not sure if I read your comment correctly. Did you mean that when you set stack size to 512kB on macOS, you only get 384 kB?
@EgorBo PAL doesn't attempt to override the default stack size except for MUSL Linuxes where it is so low that it is unusable for .NET (tens of kilobytes). But if the size is specified when creating the thread, it should be honored. I am not sure if I read your comment correctly. Did you mean that when you set stack size to 512kB on macOS, you only get 384 kB?
I haven't tested explicit stack size on macOS (can try once I boot to it) but the default one seems to be 384kb. I only tested the explicit on Windows and it gave me what I requested - 512kb.
I was measuring the subraction of these two on both platforms.
According to Apple's doc, the default stack size for the main thread is 8MB and for the secondary threads it is 512kB. https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Multithreading/CreatingThreads/CreatingThreads.html
However, the m_CacheStackLimit
on macOS for secondary threads is not necessarily reliable. We get it from the pthread_get_stacksize_np
and there are reports of it giving various results based on the macOS version.
According to Apple's doc, the default stack size for the main thread is 8MB and for the secondary threads it is 512kB.
Note we already don't follow the default on Windows (we went from 1MB -> 1.5MB for all threads), I see no reason to not increase it for secondary threads on macOS also since 512kB is clearly unacceptably low when we expect 128kB to be used for stack traces, and another 128kB to be available for EnsureSufficientExecutionStack
.
Would a PR to change the stack size on macOS be taken? We could also make all desktop platforms have consistent stack size if that's desired too.
@hamarb123 I would be happy to accept a PR that unifies default secondary stack sizes on 64 bit platforms. Even for Windows where by an accidental change the default stack size was changed from 4MB to 1.5MB about 8 years ago (see https://github.com/dotnet/runtime/issues/96347). @jkotas I wonder if we should move all the 64 bit targets to use 4MB default stack size or be more conservative. It seems that the 1.5MB that we were using before was sufficient in most of the cases (based on the low number of reports of stack overflow issues), so maybe jumping to 4MB might be unnecessarily large step.
Lowering the stack limit on Windows would almost certainly lead to customer dissatisfaction. The C# compiler is naturally recursive and it's limits in certain areas are tied to the default thread stack size. There are a non-trivial number of customers out there who run right on the edge of what we can support today (and we have extensive tests to make sure we don't regress this experience because of previous incidents). Any reduction in stack size would end up breaking our ability to build those customers code bases.
@jaredpar I was definitely not suggesting to lower the current limit, we now have it set to 1.5MB, my point was whether we should move it to 4MB or just say 2MB.
@janvorli
Gotcha
I wonder if we should move all the 64 bit targets to use 4MB default stack size or be more conservative
I was reading the second half of that and interpreting "be more conservative" as potentially considering a lower stack size. Definitely happy with any change that makes Mac bigger.
@jkotas I wonder if we should move all the 64 bit targets to use 4MB default stack size or be more conservative.
If macOS is the only one giving us trouble, I would fix macOS to match Windows and leave it at that.
The max stack size is a perf trade-off. The larger max stack size, the more (unused) memory is potentially committed to stacks. Growing max stack size is potentially replacing stack overflow with out of memory.
Some algorithms (like our own async - look for TryEnsureSufficientExecutionStack
) try to use the stack as much as possible and gracefully cut over when the current thread stack runs out. These algorithms will make sure that the whole available stack is committed.
Good point about the async stuff committing the stack space.
@jkotas I wonder if we should move all the 64 bit targets to use 4MB default stack size or be more conservative.
If macOS is the only one giving us trouble, I would fix macOS to match Windows and leave it at that.
The max stack size is a perf trade-off. The larger max stack size, the more (unused) memory is potentially committed to stacks. Growing max stack size is potentially replacing stack overflow with out of memory.
Some algorithms (like our own async - look for
TryEnsureSufficientExecutionStack
) try to use the stack as much as possible and gracefully cut over when the current thread stack runs out. These algorithms will make sure that the whole available stack is committed.
Do we have these sorts of problems on Linux today, which uses 8MB for every thread? It's also worth mentioning, that we are pushing people to use the stack more and more every .NET version (which I think is a good thing), e.g., with stackalloc
first, recommendations to use structs instead of classes, inline arrays & collection expressions, upcoming params ReadOnlySpan<...>
- all these features are slowly consuming more and more stack space for the same code.
I propose that we change Windows back to 4MB (which it used to be), or we could do 2MB for secondary threads if we think 4 is too high, and change macOS to match it:
OS | Main thread | Secondary threads |
---|---|---|
Windows | 1.5MB -> 2 or 4MB | 1.5MB -> 2 or 4MB |
Linux | 8MB | 8MB |
macOS | 8MB | 512kB -> 2 or 4MB |
Some algorithms (like our own async - look for TryEnsureSufficientExecutionStack) try to use the stack as much as possible and gracefully cut over when the current thread stack runs out. These algorithms will make sure that the whole available stack is committed.
Do we have these sorts of problems on Linux today, which uses 8MB for every thread
FWIW The C# compiler uses TryEnsureSufficientExecutionStack
pretty liberally across all threads. It hasn't caused any issues that I'm aware of. At the same time the compiler is a known memory hog.
Failing on
osx-x64 Release AllSubsets_Mono_Minijit_RuntimeTests minijit
:Failing build:
This was hit on https://github.com/dotnet/runtime/pull/87522 .
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=315357 Result validation: :white_check_mark: Known issue matched with the provided build.
Report
Summary