dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.2k stars 4.72k forks source link

In certain cases, IEnumerator::MoveNext() throws System.InvalidProgramException #9182

Closed myarichuk closed 4 years ago

myarichuk commented 7 years ago

Reproduction of this exception is in the following gist: https://gist.github.com/myarichuk/db2a290ec8ae5ea0faf2698b0c5b02c2

In this case, InvalidProgramException exception is thrown when the anonymous object in in the Map_0() function has more than 43 fields.

The code in the gist runs fine in a .Net 4.7 project.

0xd4d commented 7 years ago

It only crashes if it's run as a 64-bit process. It doesn't matter if it's .NET Core 2.0 or .NET Framework 4.7.1.

mikedn commented 7 years ago

The exception originates in this JIT code: https://github.com/dotnet/coreclr/blob/15e1472aab7ed42dafde0331ed6cece465a0763d/src/jit/emit.cpp#L567-L582

I haven't checked yet the IL but it looks like a method contains a large number of locals (56129). Or the JIT generates a ton of locals when importing it...

AndyAyersMS commented 7 years ago

Maybe related to dotnet/runtime#8420?

mikedn commented 7 years ago

Maybe related to 12480?

Yup, it looks very similar. The IL method has only 4 locals but the IL code itself contains an unfortunate pattern - load a bunch of stuff onto the stack and then branch. The net result - an avalanche of spill temps.

mikedn commented 7 years ago

Some numbers: lva count = 56129 bb count = 2194 il size = 68941 dynamic call sites = 858

I suppose it would be nice for the JIT to handle such a monster but the code is as horrible as it gets. Hopefully the sample code is just that, sample code, and not actual production code.

myarichuk commented 7 years ago

It is not a production code, but it is related to production code: The sample code in the gist came from an index compiled by RavenDB 4.0 - it is part of an issue investigation here - http://issues.hibernatingrhinos.com/issue/RavenDB-9040 (In RavenDB indexes are defined as LINQ queries and compiled at the server-side.)

Since RavenDB allows users to define indexes with the syntax of LINQ queries, it is definitely possible that users will write similar code to what can be seen in the repro gist.

(Index compiler code can be seen here: https://github.com/ravendb/ravendb/blob/v4.0/src/Raven.Server/Documents/Indexes/Static/IndexCompiler.cs)

myarichuk commented 7 years ago

@0xd4d you are right - since I tested that on AnyCPU, haven't noticed that it depends on 64/32 bits and not on runtime.

ayende commented 7 years ago

@mikedn the code is something that went through multiple translation stages to get where it did. The actual originating code isn't too hot either, though much simpler.

mikedn commented 7 years ago

the code is something that went through multiple translation stages to get where it did. The actual originating code isn't too hot either, though much simpler.

Presumably moving doc.NumVals.ContainsKey("Value001") && doc.NumVals["Value001"] != null ? (doc.NumVals["Value001"]).Value1 : ((double?)null), to a double? GetValue(dynamic values, string name) isn't an option?

It only crashes if it's run as a 64-bit process. It doesn't matter if it's .NET Core 2.0 or .NET Framework 4.7.1.

Yep, works on x86 and requires over 200 kbytes of stack :)

ayende commented 7 years ago

The original code looked something like this:

   select new
                    {
                        doc.Id,
                        Value001 = doc.NumVals.ContainsKey("Value001") && doc.NumVals["Value001"] != null ? doc.NumVals["Value001"].Value1 : (double?)null,
                        Value002 = doc.NumVals.ContainsKey("Value002") && doc.NumVals["Value002"] != null ? doc.NumVals["Value002"].Value1 : (double?)null,
}               

And yes, (double?)doc.NumVals["Value002"]?.Value1 should actually work here. But that is code sent by customer.

AndyAyersMS commented 7 years ago

Also it may be that disabling CSC's optimizer here (assuming it is on) will lessen CSC's appetite for pending things on the eval stack across branches.

ayende commented 7 years ago

How do I do that?

mikedn commented 7 years ago

Also it may be that disabling CSC's optimizer here (assuming it is on) will lessen CSC's appetite for pending things on the eval stack across branches.

Tried, no effect. It's quite possible that the CSC optimizer isn't involved in this, that's just how CSC generates dynamic callsites.

jkotas commented 7 years ago

Also related to https://github.com/dotnet/coreclr/issues/14103

AndyAyersMS commented 7 years ago

I looked at fixing dotnet/runtime#8420 in the jit by re-using spill temps across multiple joins, but it quickly turns into solving a general pre-import dataflow problem over the entire flow graph where we'd first need to scan the IL to figure out the stack behavior of each block. Today the jit only "solves" a constrained version of this for spill cliques.

Wonder how hard it would be to fix this in the encoder, maybe via some kind of escape bits for local addresses that index into a separate table with bigger entries.

mikedn commented 7 years ago

Perhaps it would better if managed compilers refrain from generating this kind of IL. Issues like this and dotnet/runtime#9130 seem to indicate that it's just too much trouble to deal with this aspect of the IL stack. And it's quite possible that managed compilers think that they're doing the JIT a favor by keeping things on the stack instead of spilling to locals...

AndyAyersMS commented 6 years ago

We now produce a somewhat clearer error message when the jit hits an internal limitation (see dotnet/coreclr#16320). So hopefully anyone who hits this realizes that there is something overly complex about their method and can find workarounds.

Moving this issue to future as we should still look into ways to reduce the number of cases where the jit runs into internal limitations.

AndyAyersMS commented 6 years ago

This may have been fixed by dotnet/coreclr#17780.

@briansull can you take a look?

briansull commented 6 years ago

Fixed with dotnet/coreclr#17793

ppekrol commented 6 years ago

Hi, we are using .NET Core 2.1.2 and this issue still occurs. Failing test here: https://github.com/ravendb/ravendb/blob/v4.0/test/SlowTests/Issues/RavenDB_10621.cs

maximburyak commented 6 years ago

Hi, Running 2.1.400-preview-009063, we have another repro of this (similiar) issue, see: https://gist.github.com/maximburyak/aa75e379e3315af6980bb6ec8fd8dce1

Note that replacing "this0.s." with an empty string allows the code to run successfully.