Closed ahsonkhan closed 4 years ago
@ahsonkhan, it should have uploaded a dump from the remote process: "Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp" Were you able to grab that? It looks like you restarted the legs.
Previously this was blocked by https://github.com/dotnet/core-eng/issues/7950 (https://github.com/dotnet/corefx/issues/41528) .. if dumps work now that's awesome.
@danmosemsft, the one with the dump is Windows.
@stephentoub ah, I just saw the RedHat above.
It looks like https://github.com/dotnet/core-eng/issues/7950 was recently closed, so I would hope to get a dump from that one too.
so I would hope to get a dump from that one too.
The dump that's relevant here is from the RemoteExecutor.Invoke child process; on Windows we now explicitly P/Invoke to MiniDumpWriteDump in order to save out a dump. That doesn't happen on Linux. https://github.com/dotnet/arcade/pull/4085
Should we have an issue to find a way to do it on Linux? Presumably deploying and invoking createdump?
Sure :)
Were you able to grab that? It looks like you restarted the legs.
No, sorry. Will remember to do that next time.
updated link above (edited above post as the link was wrong)
0:000> k
# Child-SP RetAddr Call Site
00 00000005`6597c310 00007ffc`becda9b5 coreclr!memset_repmovs+0x9 [d:\agent\_work\2\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 66]
01 00000005`6597c320 00007ffc`becda417 coreclr!WKS::gc_heap::adjust_limit_clr+0x155 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 12087]
02 (Inline Function) --------`-------- coreclr!WKS::gc_heap::a_fit_segment_end_p+0x134 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 12864]
03 (Inline Function) --------`-------- coreclr!WKS::gc_heap::soh_try_fit+0x1ad [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13002]
04 00000005`6597c390 00007ffc`becda148 coreclr!WKS::gc_heap::allocate_small+0x247 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13067]
05 (Inline Function) --------`-------- coreclr!WKS::gc_heap::try_allocate_more_space+0xcb [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13939]
06 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0xcb [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14369]
07 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x116 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14400]
08 00000005`6597c490 00007ffc`bec54ee7 coreclr!WKS::GCHeap::Alloc+0x148 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 35826]
09 00000005`6597c4f0 00007ffc`bec54732 coreclr!Alloc+0x16f [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 251]
0a 00000005`6597c550 00007ffc`bec6603c coreclr!AllocateSzArray+0xee [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 543]
0b (Inline Function) --------`-------- coreclr!AllocateSzArray+0x19 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 421]
0c 00000005`6597c5f0 00007ffc`bec64d55 coreclr!AllocateObjectArray+0x48 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 964]
0d 00000005`6597c630 00007ffc`bec54020 coreclr!GCHeapHash<KeyToValuesGCHeapHashTraits<InliningInfoTrackerHashTraits> >::CheckGrowth+0xc9 [f:\workspace\_work\1\s\src\vm\gcheaphashtable.inl @ 248]
0e (Inline Function) --------`-------- coreclr!GCHeapHash<KeyToValuesGCHeapHashTraits<InliningInfoTrackerHashTraits> >::Add+0x9 [f:\workspace\_work\1\s\src\vm\gcheaphashtable.inl @ 438]
0f 00000005`6597c660 00007ffc`bec5399e coreclr!CrossLoaderAllocatorHash<InliningInfoTrackerHashTraits>::Add+0x314 [f:\workspace\_work\1\s\src\vm\crossloaderallocatorhash.inl @ 389]
10 (Inline Function) --------`-------- coreclr!JITInlineTrackingMap::AddInliningDontTakeLock+0x5f3 [f:\workspace\_work\1\s\src\vm\inlinetracking.cpp @ 664]
11 (Inline Function) --------`-------- coreclr!JITInlineTrackingMap::AddInlining+0xa46 [f:\workspace\_work\1\s\src\vm\inlinetracking.cpp @ 648]
12 (Inline Function) --------`-------- coreclr!Module::AddInlining+0xa6e [f:\workspace\_work\1\s\src\vm\ceeload.cpp @ 142]
13 00000005`6597c740 00007ffc`d4649e8a coreclr!CEEInfo::reportInliningDecision+0x110e [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 8319]
14 00000005`6597cd10 00007ffc`d45c6b13 clrjit!InlineResult::Report+0xde [f:\workspace\_work\1\s\src\jit\inline.cpp @ 749]
15 (Inline Function) --------`-------- clrjit!InlineResult::{dtor}+0x9 [f:\workspace\_work\1\s\src\jit\inline.h @ 420]
16 00000005`6597cd50 00007ffc`d45c5041 clrjit!Compiler::fgInline+0x273 [f:\workspace\_work\1\s\src\jit\flowgraph.cpp @ 22027]
17 00000005`6597ced0 00007ffc`d4622d38 clrjit!Compiler::fgMorph+0x131 [f:\workspace\_work\1\s\src\jit\morph.cpp @ 16632]
18 00000005`6597d1b0 00007ffc`d4631726 clrjit!Compiler::compCompile+0x368 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 4560]
19 00000005`6597d2f0 00007ffc`d462fd7d clrjit!Compiler::compCompileHelper+0x296 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 6099]
1a 00000005`6597d3a0 00007ffc`d462fa33 clrjit!Compiler::compCompile+0x21d [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 5427]
1b 00000005`6597d450 00007ffc`d462ecc2 clrjit!jitNativeCode+0x273 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 6727]
1c 00000005`6597d610 00007ffc`bec79699 clrjit!CILJit::compileMethod+0x92 [f:\workspace\_work\1\s\src\jit\ee_il_dll.cpp @ 319]
1d (Inline Function) --------`-------- coreclr!invokeCompileMethodHelper+0x63 [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12464]
1e 00000005`6597d680 00007ffc`bec7952c coreclr!invokeCompileMethod+0xdd [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12529]
1f 00000005`6597d720 00007ffc`bec78dbf coreclr!CallCompileMethodWithSEHWrapper+0x50 [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12583]
20 00000005`6597d7c0 00007ffc`bec77d91 coreclr!UnsafeJitFunction+0xe1f [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 13070]
21 00000005`6597dbc0 00007ffc`bec77889 coreclr!MethodDesc::JitCompileCodeLocked+0x2e1 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 974]
22 00000005`6597ddb0 00007ffc`bec771f4 coreclr!MethodDesc::JitCompileCodeLockedEventWrapper+0x33d [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 830]
23 00000005`6597def0 00007ffc`becbbdca coreclr!MethodDesc::JitCompileCode+0x194 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 769]
24 00000005`6597df90 00007ffc`bed595a5 coreclr!MethodDesc::PrepareILBasedCode+0xfa [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 398]
25 (Inline Function) --------`-------- coreclr!MethodDesc::PrepareCode+0xc [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 330]
26 00000005`6597dfc0 00007ffc`becb9d06 coreclr!CodeVersionManager::PublishVersionableCodeIfNecessary+0x121 [f:\workspace\_work\1\s\src\vm\codeversion.cpp @ 1800]
27 00000005`6597e140 00007ffc`becb9815 coreclr!MethodDesc::DoPrestub+0x2b6 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 1972]
28 00000005`6597e210 00007ffc`bed4db95 coreclr!PreStubWorker+0x455 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 1798]
29 00000005`6597e420 00007ffc`5f264362 coreclr!ThePreStub+0x55 [F:\workspace\_work\1\s\src\vm\amd64\ThePreStubAMD64.asm @ 22]
2a 00000005`6597e4d0 000001c6`00010fa8 0x00007ffc`5f264362
2b 00000005`6597e4d8 00000005`6597e518 0x000001c6`00010fa8
2c 00000005`6597e4e0 000001c6`0000cc90 0x00000005`6597e518
2d 00000005`6597e4e8 00000000`ffffffff 0x000001c6`0000cc90
2e 00000005`6597e4f0 00000000`0000003e 0xffffffff
2f 00000005`6597e4f8 00000000`00000003 0x3e
30 00000005`6597e500 00000000`00000000 0x3
cc @jkotas
And for learning purposes, to find the uploaded files (i.e. after a retry) you can use the Helix API directly: https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/files
This will list all uploaded files, in this case I searched for the ".dmp" file and added it to the url: "/4708.wrrjax4w.c3b.dmp".
Thanks, @ViktorHofer.
This is happening early on in the invocation of the remote process, as we're looking up the method to invoke:
coreclr!memset_repmovs+0x9
This seems to be stuck on a page-in request in the OS. I do not see anything wrong on the runtime side.
The GC is trying to clear memory from 0x000001c600011398 to 0x000001c6000139d8. The clearing is stuck at address 0x000001c600013000 (ie page boundary).
We need to get more dumps for this to see the pattern.
From just looking at this dump, my bet would be on infrastructure problem: The physical machine is over-subscribed and super slow.
we're looking up the method to invoke
Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke.
Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke.
Right, understood. I was pasting the image from VS to show the managed frames that were obscured in the windbg stack trace as well as the arguments to those methods.
Odd that we've had 3 hangs (in remote executor in Regex cache tests). I'd have thought a slow machine might give a more distributed set of hangs..
They can be each different root cause. Do you happen to have dumps for the other 2 hangs?
We don't have more dumps yet and we only produce ones on Windows for these kinds of failures.
This failed again just now on Windows, here's the dump: https://helix.dot.net/api/2019-06-17/jobs/f20c2934-638e-4637-809b-b960d0287af0/workitems/System.Text.RegularExpressions.Tests/files/6112.xqszggzs.qit.dmp
We are in the middle of JITing System.Text.RegularExpressions.Regex.InitDefaultMatchTimeout
called from RegEx static constructor. The thread is not blocked for any obvious reason.
So the common theme is:
Keep more dumps coming...
Another hit in https://github.com/dotnet/coreclr/pull/27375 . System.Transactions tests. The process is stuck early during startup inside System.Transactions.TransactionsEtwProvider
. This hit was on checked build of the runtime that allows us to see what happened in the past using stresslog.
The full stress log is here: log.txt
The interesting bits of the stress log are:
THREAD TIMESTAMP FACILITY MESSAGE
ID (sec from start)
--------------------------------------------------------------------------------------
173c 153.830905200 : `CLASSLOADER` MethodTableBuilder: finished method table for module 00007FFD947D4020 token 20006be = 00007FFD94C9C678 (System.Globalization.UnicodeCategory)
173c 152.449501000 : `CLASSLOADER` MethodTableBuilder: finished method table for module 00007FFD947D4020 token 200037c = 00007FFD94AB3200 (System.Runtime.Intrinsics.Vector128`1)
1108 110.031347100 : `GCROOTS` Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108 110.031328100 : `GCROOTS` Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108 110.031286600 : `GCROOTS` Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108 110.031261900 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 3, (requested generation = 2, collect_classes = 9) ==========
1108 110.031259000 : `GC` (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108 110.031258700 : `GC` (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108 110.031258500 : `GC` (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108 110.031258300 : `GC` (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108 110.031258000 : `GC` (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108 110.031257700 : `GC` Start of GC stack
1108 61.804306200 : `GCROOTS` Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108 61.804288500 : `GCROOTS` Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108 61.804249300 : `GCROOTS` Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108 61.804225600 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 2, (requested generation = 1, collect_classes = 9) ==========
1108 61.804222700 : `GC` (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108 61.804222500 : `GC` (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108 61.804222300 : `GC` (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108 61.804222000 : `GC` (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108 61.804221800 : `GC` (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108 61.804221600 : `GC` Start of GC stack
1108 13.291604900 : `GCROOTS` Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108 13.291578000 : `GCROOTS` Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108 13.291504100 : `GCROOTS` Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108 13.291459400 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 1, (requested generation = 1, collect_classes = 9) ==========
1108 13.291456300 : `GC` (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108 13.291456000 : `GC` (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108 13.291455800 : `GC` (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108 13.291455500 : `GC` (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108 13.291455300 : `GC` (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108 13.291455100 : `GC` Start of GC stack
1108 13.291396100 : `GC` GC is triggered
173c 0.335993200 : `CLASSLOADER` MethodTableBuilder: finished method table for module 00007FFD947D4020 token 2000175 = 00007FFD94AB2D30 (System.SpanHelpers)
This confirms my theory that the machine is oversubscribed when this hang/timeout happens. We need to find out what else is going on the machine that is taking all resources.
Another one on RHEL7. No dump because of https://github.com/dotnet/arcade/issues/4153, but for the record: netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_culture_and_options
Repro with the more verbose logging from https://github.com/dotnet/coreclr/pull/27554: console.0680a4aa (1).txt
The machine memory load is 94% that is way too much. The top processes consuming memory
Process | Private Memory |
---|---|
svchost | 8,848,990,208 |
dotnet | 3,262,791,680 |
MsMpEng | 136,298,496 |
ServerManager | 99,528,704 |
WindowsAzureTelemetryService | 55,070,720 |
WindowsAzureGuestAgent | 47,337,472 |
WaAppAgent | 46,862,336 |
python | 24,686,592 |
Looks like a svchost process got out of control.
I assume that the Helix machines are recycled without rebooting. Is there a way to find out what run on the Helix machine before? I suspect that there is a misbehaving test workitem that leaves the machine in a bad state.
@dotnet/dnceng Could you please help here? We want to find out what is causing svchost memory consumption to go through the roof on a Helix machine. Would it be possible to log the machine memory consumption before and after the workitem and identify jobs that make the workingset of any svchost process to go above 1GB?
@jkotas correct, and we do make an earnest effort to kill any processes started as part of a run. For any given work item it's trivial to query every work item that ran on that machine and I will look into this specific one shortly.
@jkotas that machine ran precisely two work items between being created and the RegEx test.
Logs: ConsoleUri https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-coreclr-refs-pull-27554-merge-75efd54249d343ebae/System.Runtime.Handles.Tests/console.ba534629.log?sv=2018-03-28&sr=c&sig=fgt2qYutHNBZIKtCdkhSs%2FupTo8ZDEuZTT0lOTCizl4%3D&se=2019-11-09T18%3A13%3A05Z&sp=rl https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-coreclr-refs-pull-27554-merge-75efd54249d343ebae/System.ComponentModel.Composition.Registration.Tests/console.fad9ebf4.log?sv=2018-03-28&sr=c&sig=fgt2qYutHNBZIKtCdkhSs%2FupTo8ZDEuZTT0lOTCizl4%3D&se=2019-11-09T18%3A13%3A05Z&sp=rl
Kusto query:
WorkItems
| where QueueName == "windows.10.amd64.open"
| where MachineName == "b000MOS"
| order by Finished asc
so ... it's possible those other two work items spawned unnoticed somehow-unkillable processes (I'll check their harness logs after my next meeting) that consumed all 7 GB of RAM on this machine, but since there's only 2 it seems unlikely .
Harness logs were pretty boring too: LogUri https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-coreclr-refs-pull-27554-merge-75efd54249d343ebae/System.ComponentModel.Composition.Registration.Tests/ba80c381-36ae-4daa-a1c2-4dd7a7bc3770.log?sv=2018-03-28&sr=c&sig=fgt2qYutHNBZIKtCdkhSs%2FupTo8ZDEuZTT0lOTCizl4%3D&se=2019-11-09T18%3A13%3A05Z&sp=rl https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-coreclr-refs-pull-27554-merge-75efd54249d343ebae/System.Runtime.Handles.Tests/e2a27f45-aa3e-4cae-a895-a2ffdcd4bbf2.log?sv=2018-03-28&sr=c&sig=fgt2qYutHNBZIKtCdkhSs%2FupTo8ZDEuZTT0lOTCizl4%3D&se=2019-11-09T18%3A13%3A05Z&sp=rl
I have done an experiment to capture the memory utilization on Helix machine when running RegularExpressions tests. The top processes consuming memory:
Process | Private Memory |
---|---|
dotnet | 3,271,024,640.00 |
ServerManager | 97,714,176.00 |
WindowsAzureTelemetryService | 53,587,968.00 |
WaAppAgent | 47,771,648.00 |
WindowsAzureGuestAgent | 47,554,560.00 |
explorer | 35,155,968.00 |
python | 26,943,488.00 |
svchost | 18,296,832.00 |
So everything is very similar to the failure case, except the 9GB svchost.
@jkotas I'm having a hard time inferring whether there's a problem here (that list seems normal to me based off what I know is running there... the Azure stuff is non-negotiable and Server manager running on startup is a default Windows Server behavior). Can you clarify if you think there's an infra-related problem?
I am adding an extra instrumentation to figure which service is eating the memory: https://github.com/dotnet/arcade/pull/4253
It will require a few more peeling the onion iterations to find the root cause. My bet is that the root cause will be infrastructure related.
I am adding an extra instrumentation to figure which service is eating the memory: dotnet/arcade#4253
It will require a few more peeling the onion iterations to find the root cause. My bet is that the root cause will be infrastructure related.
I'd be surprised, but I'm following this thread eagerly and will get an issue created to update the machines if you find something. We don't actually install any Windows services on that queue so I'd be surprised to see that be the root cause.
@jkotas it seems like this test run hit a Timeout in the remote process and it dump the logging you added:
System.Tests.SetEnvironmentVariable.EnvironmentVariableValueTooLarge_Throws [FAIL]
Half-way through waiting for remote process.
Memory load: 99
Process: 0 Idle PrivateMemory: 0
Process: 4 System PrivateMemory: 131072
Process: 296 smss PrivateMemory: 409600
Process: 384 csrss PrivateMemory: 1937408
Process: 464 wininit PrivateMemory: 962560
Process: 472 csrss PrivateMemory: 1974272
Process: 524 winlogon PrivateMemory: 1953792
Process: 596 services PrivateMemory: 3325952
Process: 604 lsass PrivateMemory: 5713920
Process: 688 svchost PrivateMemory: 5955584
Process: 744 svchost PrivateMemory: 4235264
Process: 864 dwm PrivateMemory: 17993728
Process: 892 svchost PrivateMemory: 4411392
Process: 900 svchost PrivateMemory: 5726212096
Process: 964 svchost PrivateMemory: 15978496
Process: 992 svchost PrivateMemory: 11792384
Process: 376 svchost PrivateMemory: 11730944
Process: 348 svchost PrivateMemory: 7675904
Process: 1092 svchost PrivateMemory: 3018752
Process: 1308 svchost PrivateMemory: 7344128
Process: 1316 VSSVC PrivateMemory: 1572864
Process: 1532 svchost PrivateMemory: 1712128
Process: 1960 spoolsv PrivateMemory: 5685248
Process: 2008 svchost PrivateMemory: 3768320
Process: 2024 svchost PrivateMemory: 7680000
Process: 1264 svchost PrivateMemory: 2101248
Process: 1608 MsMpEng PrivateMemory: 115113984
Process: 320 svchost PrivateMemory: 4354048
Process: 1804 WaAppAgent PrivateMemory: 46907392
Process: 1668 WindowsAzureGuestAgent PrivateMemory: 43810816
Process: 2020 WindowsAzureTelemetryService PrivateMemory: 57057280
Process: 2064 svchost PrivateMemory: 7237632
Process: 2092 sqlwriter PrivateMemory: 1744896
Process: 3164 RuntimeBroker PrivateMemory: 8122368
Process: 3224 rundll32 PrivateMemory: 3051520
Process: 3244 svchost PrivateMemory: 3911680
Process: 3268 taskhostw PrivateMemory: 4669440
Process: 3364 sihost PrivateMemory: 4149248
Process: 3572 explorer PrivateMemory: 22745088
Process: 3660 svchost PrivateMemory: 2891776
Process: 860 rundll32 PrivateMemory: 3244032
Process: 1708 ShellExperienceHost PrivateMemory: 16338944
Process: 4212 ServerManager PrivateMemory: 97263616
Process: 4276 SearchUI PrivateMemory: 15568896
Process: 4904 WmiPrvSE PrivateMemory: 1957888
Process: 5716 dhandler PrivateMemory: 14733312
Process: 5816 WaSecAgentProv PrivateMemory: 688128
Process: 5880 cmd PrivateMemory: 1765376
Process: 5888 conhost PrivateMemory: 5382144
Process: 5952 conhost PrivateMemory: 1171456
Process: 4672 msdtc PrivateMemory: 2732032
Process: 4240 python PrivateMemory: 5480448
Process: 4792 python PrivateMemory: 14544896
Process: 3600 python PrivateMemory: 27873280
Process: 4164 WindowsAzureNetAgent PrivateMemory: 1929216
Process: 4200 VFPlugin PrivateMemory: 2756608
Process: 4364 conhost PrivateMemory: 1355776
Process: 3108 MpCmdRun PrivateMemory: 2871296
Process: 2180 taskhostw PrivateMemory: 1159168
Process: 3296 cmd PrivateMemory: 2207744
Process: 6072 dotnet PrivateMemory: 194486272
Process: 4788 dotnet PrivateMemory: 2191015936
Process: 3104 dotnet PrivateMemory: 2084864
Image Name PID Services
========================= ======== ============================================
svchost.exe 688 BrokerInfrastructure, DcomLaunch, LSM,
PlugPlay, Power, SystemEventsBroker
svchost.exe 744 RpcEptMapper, RpcSs
svchost.exe 892 TermService
svchost.exe 900 Appinfo, CertPropSvc, DsmSvc, gpsvc,
iphlpsvc, lfsvc, ProfSvc, sacsvr, Schedule,
SENS, SessionEnv, ShellHWDetection, Themes,
UserManager, Winmgmt, WpnService, wuauserv
svchost.exe 964 Dhcp, EventLog, lmhosts, TimeBrokerSvc,
vmictimesync
svchost.exe 992 NcbService, PcaSvc, TrkWks, UALSVC,
UmRdpService, vmickvpexchange,
vmicshutdown, vmicvss, WdiSystemHost
svchost.exe 376 BFE, CoreMessagingRegistrar, DPS, MpsSvc,
pla
svchost.exe 348 CDPSvc, EventSystem, FontCache, netprofm,
nsi, W32Time, WinHttpAutoProxySvc
svchost.exe 1092 vmicheartbeat, vmicrdv
svchost.exe 1308 CryptSvc, Dnscache, LanmanWorkstation,
NlaSvc, WinRM
svchost.exe 1532 Wcmsvc
svchost.exe 2008 AppHostSvc
svchost.exe 2024 DiagTrack
svchost.exe 1264 LanmanServer
svchost.exe 320 W3SVC, WAS
svchost.exe 2064 StateRepository, tiledatamodelsvc
svchost.exe 3244 CDPUserSvc_549b4, OneSyncSvc_549b4
svchost.exe 3660 AppReadiness
Timed out at 3/20/2020 9:03:40 PM after 60000ms waiting for remote process.
Wrote mini dump to: C:\h\w\B4F709DF\w\A5E608BA\uploads\4788.3sdz2mzz.mdg.dmp
Process ID: 4788
Handle: 1736
Name: dotnet
MainModule: C:\h\w\B4F709DF\p\dotnet.exe
StartTime: 3/20/2020 9:01:47 PM
TotalProcessorTime: 00:00:04.8750000
Stack Trace:
/_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(219,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
/_/src/libraries/System.Runtime.Extensions/tests/System/Environment.SetEnvironmentVariable.cs(118,0): at System.Tests.SetEnvironmentVariable.EnvironmentVariableValueTooLarge_Throws()
https://dev.azure.com/dnceng/public/_build/results?buildId=567555&view=logs
and this is the dump: https://helix.dot.net/api/2019-06-17/jobs/a67b3e67-8ba0-47bd-b6b3-f5c69172fed6/workitems/System.Runtime.Extensions.Tests/files/4788.3sdz2mzz.mdg.dmp
So the top processes consuming memory are:
PID | Process | Private Memory |
---|---|---|
900 | svchost | 5726212096 |
4788 | dotnet | 2191015936 |
6072 | dotnet | 194486272 |
1608 | MsMpEng | 115113984 |
4212 | ServerManager | 97263616 |
2020 | WindowsAzureTelemetryService | 57057280 |
PID 900 is running these services:
So one these is misbehaving.
I took a snapshot in https://github.com/dotnet/runtime/pull/33905 of what this service process contains under normal circumstances:
AeLookupSvc, Appinfo, BITS, CertPropSvc,
gpsvc, iphlpsvc, LanmanServer, ProfSvc,
sacsvr, Schedule, SENS, SessionEnv,
ShellHWDetection, Themes, Winmgmt, wuauserv
So this makes the following services that are not running under normal circumstances to be the likely culprits:
@MattGal Would it be possible to find out how many CI test machines have a running Device Setup Manager and Geolocation Service services, and what is starting these services on those machines?
Geolocation Service in particular is not something I would expect to be running on CI test machines.
@dotnet/dnceng can you check to see why Device Setup Manager and Geolocation service are running on these machines?
I will have FR take a look at this today. (Issue dotnet/core-eng/#9444)
@dotnet/dnceng can you check to see why Device Setup Manager and Geolocation service are running on these machines?
Couple notes here:
System.Xml.Linq.xNodeBuilder.Tests
System.ComponentModel.Composition.Registration.Tests
System.Globalization.Calendars.Tests
System.Net.Http.Functional.Tests
System.Net.Primitives.Pal.Tests
System.Reflection.CoreCLR.Tests
In short, this is a pretty default Windows 10 machine running on a D2_v3 Azure VM (that's only 8 GB of RAM, which never was much for a Win10 server). I doubt there's much we can do in terms of investigation, but we could explicitly disable these services if we truly think they're memory hogs.
These services are best guess at what's causing the problem. We are not sure whether they are really the problem.
Does Helix have something like a "health check" of a machine that runs before workitem starts executing? Can the health check look at the available memory before the workitem starts, and recycle the machine when it sees that the machine is in a bad state?
From https://github.com/dotnet/corefx/pull/41753
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries netcoreapp-Windows_NT-Debug-x64-(Windows.Nano.1809.Amd64.Open)
https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213282&resultId=145033&paneView=debug
https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/console
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open
https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213340&paneView=debug
https://helix.dot.net/api/2019-06-17/jobs/82ca2546-9df2-4b32-ab1f-d00835fcbe01/workitems/System.Text.RegularExpressions.Tests/console
cc @ViktorHofer