Closed BruceForstall closed 1 year ago
Tagging subscribers to this area: @dotnet/area-extensions-logging See info in area-owners.md if you want to be subscribed.
Author: | BruceForstall |
---|---|
Assignees: | - |
Labels: | `JitStress`, `area-Extensions-Logging` |
Milestone: | - |
@BruceForstall
This is an out of memory situation when running under stress environment. This easier infra issue to use a better machine configuration with more memory or the jit-stress serialize or exclude such tests. I don't think there is anything wrong with the test itself.
This issue has been marked needs-author-action
and may be missing some important information.
@BruceForstall
This is an out of memory situation when running under stress environment. This easier infra issue to use a better machine configuration with more memory or the jit-stress serialize or exclude such tests. I don't think there is anything wrong with the test itself.
I don't recall seeing this before. I'm not sure if something changed in the machine environment, stress modes, CLR VM/librararies, or test itself. I wonder if the test uses significantly more memory under stress or not.
We have had a recent trivial change in this area but shouldn't' cause any more memory allocations in general. I am not sure if something updated in the reflection or interop layer that cause more allocations.
CC @steveharter @jkoritzinsky
We haven't changed anything in interop that would use significantly more memory, especially only in JitStress.
Tagging subscribers to this area: @dotnet/area-extensions-logging See info in area-owners.md if you want to be subscribed.
Author: | BruceForstall |
---|---|
Assignees: | - |
Labels: | `arch-x86`, `os-windows`, `JitStress`, `area-Extensions-Logging` |
Milestone: | Future |
These tests have failed in multiple runs in the "runtime-coreclr libraries-jitstress" pipeline as well:
including in the "jitminopts" which disables JIT optimization.
They also have failed in multiple runs of the runtime-coreclr libraries-pgo
pipeline:
https://dev.azure.com/dnceng-public/public/_build?definitionId=145&_a=summary
Failures started in the last 2 days.
Presumably these are the changes:
https://dev.azure.com/dnceng-public/public/_traceability/runview/changes?currentRunId=318154
@stephentoub Could https://github.com/dotnet/runtime/pull/87904 cause this?
@stephentoub Could https://github.com/dotnet/runtime/pull/87904 cause this?
I don't see how it could.
The out of memory is happening during creating and building the project for the source generation SourceGenerators.Tests.RoslynTestUtils.CreateTestProject
which not really exercising any logging code at that time. Did we have any changes recently on the compiler version we are using in the runtime repo?
Well, there is this: https://github.com/dotnet/runtime/pull/87522. I don't know how to read that w.r.t. your question about compiler changes.
@jaredpar Maybe Roslyn related?
This is the SRM layer allocating in response to items in the PE file. That code comes from runtime 😄
This code really hasn't changed recently to my knowledge. It's fairly stable. That particular code path is getting it's size parameters from code within SRM so it's not a place where we're passing incorrect sizes down from Roslyn. My expectation is that if you get a dump here we would see a fairly expected size being passed to AllocHGlobal
This test (5 tests in the assembly, actually) reliably fails for me with out of memory with NO JitStress or tiering configuration variables set with a Release libraries and Checked runtime. With a Release runtime, it passes (and quickly).
I changed the label to the VM as looks the issue happen when switching to the checked runtime. @BruceForstall if your repo this reliably, it will be good to create a full dump when the issue happens. The other guess is there is something allocated a lot of native memory causing any other allocations to fail.
The other guess is there is something allocated a lot of native memory causing any other allocations to fail.
It looks like each test (?) allocates about 17MB (loading a PE file from a stream?) and AFAICT, the memory doesn't get deallocated.
It would be good for someone who knows how to debug this managed test to investigate.
@mangod9 This is assigned to area-VM-coreclr. It's creating a lot of noise in the CI.
I notice that https://github.com/dotnet/runtime/pull/85738 updated the SRM version in runtime last month. Perhaps the test should be disabled till this can be investigated? Dont believe this belongs in the VM area.
@jaredpar It smells to me like this is a memory leak in the Roslyn source generators tests, or related / called components.
The test doesn't fail with a Release runtime, but it does eat up memory until about 1.1GB before the test completes. With a Checked runtime, it eats up memory until about 1.5GB before tests start to fail. It's certainly possible that Checked Runtimes starting using more memory, changed fragmentation, etc.
This stack trace allocates a lot of memory:
> System.Reflection.Metadata.dll!System.Reflection.PortableExecutable.PEReader.PEReader(System.IO.Stream peStream, System.Reflection.PortableExecutable.PEStreamOptions options, int size) Line 191 C#
System.Reflection.Metadata.dll!System.Reflection.PortableExecutable.PEReader.PEReader(System.IO.Stream peStream, System.Reflection.PortableExecutable.PEStreamOptions options) Line 130 C#
Microsoft.CodeAnalysis.dll!Microsoft.CodeAnalysis.ModuleMetadata.CreateFromStream(System.IO.Stream peStream, System.Reflection.PortableExecutable.PEStreamOptions options) Line 133 C#
Microsoft.CodeAnalysis.dll!Microsoft.CodeAnalysis.MetadataReference.CreateFromFile(System.IO.Stream peStream, string path, Microsoft.CodeAnalysis.MetadataReferenceProperties properties, Microsoft.CodeAnalysis.DocumentationProvider documentation) Line 76 C#
Microsoft.CodeAnalysis.dll!Microsoft.CodeAnalysis.MetadataReference.CreateFromFile(string path, Microsoft.CodeAnalysis.MetadataReferenceProperties properties, Microsoft.CodeAnalysis.DocumentationProvider documentation) Line 71 C#
Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests.dll!SourceGenerators.Tests.RoslynTestUtils.CreateTestProject(System.Collections.Generic.IEnumerable<System.Reflection.Assembly> references, bool includeBaseReferences, Microsoft.CodeAnalysis.CSharp.LanguageVersion langVersion) Line 41 C#
Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests.dll!SourceGenerators.Tests.RoslynTestUtils.RunGenerator(Microsoft.CodeAnalysis.IIncrementalGenerator generator, System.Collections.Generic.IEnumerable<System.Reflection.Assembly> references, System.Collections.Generic.IEnumerable<string> sources, bool includeBaseReferences, Microsoft.CodeAnalysis.CSharp.LanguageVersion langVersion, System.Threading.CancellationToken cancellationToken) Line 159 C#
Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests.dll!Microsoft.Extensions.Logging.Generators.Tests.LoggerMessageGeneratorParserTests.RunGenerator(string code, bool wrap, bool inNamespace, bool includeBaseReferences, bool includeLoggingReferences, System.Threading.CancellationToken cancellationToken) Line 947 C#
Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests.dll!Microsoft.Extensions.Logging.Generators.Tests.LoggerMessageGeneratorParserTests.DoubleLogLevel_FirstOneSetAsMethodParameter_SecondOneInMessageTemplate_Supported() Line 317 C#
In particular, it frequently allocates 17MB. I never saw the Dispose method called.
Can the Microsoft.CodeAnalysis.dll or Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests.dll owner please investigate?
cc @jkotas
Can the Microsoft.CodeAnalysis.dll or Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests.dll owner please investigate?
Believe the Tests owner should investigate here as they're driving the Roslyn behavior here.
Believe the Tests owner should investigate here as they're driving the Roslyn behavior here.
@tarekgh Based on recent changes to src\libraries\Microsoft.Extensions.Logging.Abstractions\tests\Microsoft.Extensions.Logging.Generators.Tests, that would be you?
I'll try to take a look but will not be able to do so soon as I am looking at some higher priority stuff and I am off this week. If it is blocking, I suggest disabling the test in the failing scenario for now till we have a chance to look more.
The only way I can think of to disable the test is disable it completely for x86. The number of tests that fail varies, which makes sense as we'll run out of memory at different times.
@ericstj Can you please get someone to disable this test on x86, ASAP?
(btw, I should point out that by "this test" I mean the entire Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests assembly -- all the tests in this assembly. Because it's not just one or two, but up to 9 or so that I've seen fail)
@BruceForstall will try to disable them tomorrow if this can wait to tomorrow.
Disabling seems like the right quick fix. Probably we need to dispose this AdHocWorkspace:
Tagging subscribers to this area: @dotnet/area-extensions-logging See info in area-owners.md if you want to be subscribed.
Author: | BruceForstall |
---|---|
Assignees: | - |
Labels: | `arch-x86`, `os-windows`, `area-VM-coreclr`, `untriaged`, `blocking-clean-ci-optional`, `area-Extensions-Logging`, `Known Build Error` |
Milestone: | 8.0.0 |
In particular, it frequently allocates 17MB. I never saw the Dispose method called.
Missing Dispose is not unusual perf issue with Roslyn and System.Reflection.Metadata APIs. The documentation at https://learn.microsoft.com/en-us/dotnet/api/microsoft.codeanalysis.metadatareference.createfromfile highlights that disposing is important for managing memory footprint.
This needs to be fixed in the test.
Honestly these tests should move to the Roslyn testing SDK instead of setting things up manually. That will solve these sorts of problems and help make these tests easier to maintain.
I'll try to take a look later today.
Also, Microsoft.Extensions.Logging.Generators.Tests.LoggerMessageGeneratorParserTests.NotPartial and others also fail, depending on the run.
All on Windows/x86. JitStress is not required. It does seem to require a Checked runtime, and does not fail with a Release runtime.
E.g.,
https://dev.azure.com/dnceng-public/public/_build/results?buildId=317311&view=ms.vss-test-web.build-test-results-tab&runId=6505130&resultId=137246&paneView=debug
It's not clear when this started happening. The last
runtime-coreclr libraries-jitstress
pipeline without this failure was https://dev.azure.com/dnceng-public/public/_build/results?buildId=316592&view=results. However, I can repro the failure locally even with this git hash.Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=317311 Result validation: :white_check_mark: Known issue matched with the provided build.
Report
Summary