dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.74k forks source link

Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610

Closed ahsonkhan closed 4 years ago

ahsonkhan commented 5 years ago

From https://github.com/dotnet/corefx/pull/41753

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries netcoreapp-Windows_NT-Debug-x64-(Windows.Nano.1809.Amd64.Open)

https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213282&resultId=145033&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/console

===========================================================================================================

C:\helix\work\workitem>"C:\helix\work\correlation\dotnet.exe" exec --runtimeconfig System.Text.RegularExpressions.Tests.runtimeconfig.json xunit.console.dll System.Text.RegularExpressions.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=nonnetcoreapptests -notrait category=nonwindowstests -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing  
  Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.RegularExpressions.Tests (found 135 test cases)
  Starting:    System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
    System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries [FAIL]
      Timed out at 10/18/2019 11:24:33 AM after 60000ms waiting for remote process.
      Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp
        Process ID: 4708
        Handle: 912

      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(71,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries()
  Finished:    System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
   System.Text.RegularExpressions.Tests  Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 118.055s
----- end Fri 10/18/2019 11:25:06.32 ----- exit code 1 ----------------------------------------------------------

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open

https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213340&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/82ca2546-9df2-4b32-ab1f-d00835fcbe01/workitems/System.Text.RegularExpressions.Tests/console

===========================================================================================================
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec ~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
  Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.RegularExpressions.Tests (found 135 test cases)
  Starting:    System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
    System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw [FAIL]
      Timed out at 10/18/2019 11:07:51 AM after 60000ms waiting for remote process.
        Process ID: 2283
        Handle: 892

      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(106,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw()
  Finished:    System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
   System.Text.RegularExpressions.Tests  Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 77.472s
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
----- end Fri Oct 18 11:07:57 UTC 2019 ----- exit code 1 ----------------------------------------------------------

cc @ViktorHofer

stephentoub commented 5 years ago

@ahsonkhan, it should have uploaded a dump from the remote process: "Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp" Were you able to grab that? It looks like you restarted the legs.

danmoseley commented 5 years ago

Previously this was blocked by https://github.com/dotnet/core-eng/issues/7950 (https://github.com/dotnet/corefx/issues/41528) .. if dumps work now that's awesome.

stephentoub commented 5 years ago

@danmosemsft, the one with the dump is Windows.

danmoseley commented 5 years ago

@stephentoub ah, I just saw the RedHat above.

danmoseley commented 5 years ago

It looks like https://github.com/dotnet/core-eng/issues/7950 was recently closed, so I would hope to get a dump from that one too.

stephentoub commented 5 years ago

so I would hope to get a dump from that one too.

The dump that's relevant here is from the RemoteExecutor.Invoke child process; on Windows we now explicitly P/Invoke to MiniDumpWriteDump in order to save out a dump. That doesn't happen on Linux. https://github.com/dotnet/arcade/pull/4085

danmoseley commented 5 years ago

Should we have an issue to find a way to do it on Linux? Presumably deploying and invoking createdump?

stephentoub commented 5 years ago

Sure :)

danmoseley commented 5 years ago

https://github.com/dotnet/arcade/issues/4153

ahsonkhan commented 5 years ago

Were you able to grab that? It looks like you restarted the legs.

No, sorry. Will remember to do that next time.

ViktorHofer commented 5 years ago

Here's the dump: https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/files/4708.wrrjax4w.c3b.dmp

ViktorHofer commented 5 years ago

updated link above (edited above post as the link was wrong)

0:000> k
 # Child-SP          RetAddr           Call Site
00 00000005`6597c310 00007ffc`becda9b5 coreclr!memset_repmovs+0x9 [d:\agent\_work\2\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 66] 
01 00000005`6597c320 00007ffc`becda417 coreclr!WKS::gc_heap::adjust_limit_clr+0x155 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 12087] 
02 (Inline Function) --------`-------- coreclr!WKS::gc_heap::a_fit_segment_end_p+0x134 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 12864] 
03 (Inline Function) --------`-------- coreclr!WKS::gc_heap::soh_try_fit+0x1ad [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13002] 
04 00000005`6597c390 00007ffc`becda148 coreclr!WKS::gc_heap::allocate_small+0x247 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13067] 
05 (Inline Function) --------`-------- coreclr!WKS::gc_heap::try_allocate_more_space+0xcb [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13939] 
06 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0xcb [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14369] 
07 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x116 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14400] 
08 00000005`6597c490 00007ffc`bec54ee7 coreclr!WKS::GCHeap::Alloc+0x148 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 35826] 
09 00000005`6597c4f0 00007ffc`bec54732 coreclr!Alloc+0x16f [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 251] 
0a 00000005`6597c550 00007ffc`bec6603c coreclr!AllocateSzArray+0xee [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 543] 
0b (Inline Function) --------`-------- coreclr!AllocateSzArray+0x19 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 421] 
0c 00000005`6597c5f0 00007ffc`bec64d55 coreclr!AllocateObjectArray+0x48 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 964] 
0d 00000005`6597c630 00007ffc`bec54020 coreclr!GCHeapHash<KeyToValuesGCHeapHashTraits<InliningInfoTrackerHashTraits> >::CheckGrowth+0xc9 [f:\workspace\_work\1\s\src\vm\gcheaphashtable.inl @ 248] 
0e (Inline Function) --------`-------- coreclr!GCHeapHash<KeyToValuesGCHeapHashTraits<InliningInfoTrackerHashTraits> >::Add+0x9 [f:\workspace\_work\1\s\src\vm\gcheaphashtable.inl @ 438] 
0f 00000005`6597c660 00007ffc`bec5399e coreclr!CrossLoaderAllocatorHash<InliningInfoTrackerHashTraits>::Add+0x314 [f:\workspace\_work\1\s\src\vm\crossloaderallocatorhash.inl @ 389] 
10 (Inline Function) --------`-------- coreclr!JITInlineTrackingMap::AddInliningDontTakeLock+0x5f3 [f:\workspace\_work\1\s\src\vm\inlinetracking.cpp @ 664] 
11 (Inline Function) --------`-------- coreclr!JITInlineTrackingMap::AddInlining+0xa46 [f:\workspace\_work\1\s\src\vm\inlinetracking.cpp @ 648] 
12 (Inline Function) --------`-------- coreclr!Module::AddInlining+0xa6e [f:\workspace\_work\1\s\src\vm\ceeload.cpp @ 142] 
13 00000005`6597c740 00007ffc`d4649e8a coreclr!CEEInfo::reportInliningDecision+0x110e [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 8319] 
14 00000005`6597cd10 00007ffc`d45c6b13 clrjit!InlineResult::Report+0xde [f:\workspace\_work\1\s\src\jit\inline.cpp @ 749] 
15 (Inline Function) --------`-------- clrjit!InlineResult::{dtor}+0x9 [f:\workspace\_work\1\s\src\jit\inline.h @ 420] 
16 00000005`6597cd50 00007ffc`d45c5041 clrjit!Compiler::fgInline+0x273 [f:\workspace\_work\1\s\src\jit\flowgraph.cpp @ 22027] 
17 00000005`6597ced0 00007ffc`d4622d38 clrjit!Compiler::fgMorph+0x131 [f:\workspace\_work\1\s\src\jit\morph.cpp @ 16632] 
18 00000005`6597d1b0 00007ffc`d4631726 clrjit!Compiler::compCompile+0x368 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 4560] 
19 00000005`6597d2f0 00007ffc`d462fd7d clrjit!Compiler::compCompileHelper+0x296 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 6099] 
1a 00000005`6597d3a0 00007ffc`d462fa33 clrjit!Compiler::compCompile+0x21d [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 5427] 
1b 00000005`6597d450 00007ffc`d462ecc2 clrjit!jitNativeCode+0x273 [f:\workspace\_work\1\s\src\jit\compiler.cpp @ 6727] 
1c 00000005`6597d610 00007ffc`bec79699 clrjit!CILJit::compileMethod+0x92 [f:\workspace\_work\1\s\src\jit\ee_il_dll.cpp @ 319] 
1d (Inline Function) --------`-------- coreclr!invokeCompileMethodHelper+0x63 [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12464] 
1e 00000005`6597d680 00007ffc`bec7952c coreclr!invokeCompileMethod+0xdd [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12529] 
1f 00000005`6597d720 00007ffc`bec78dbf coreclr!CallCompileMethodWithSEHWrapper+0x50 [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 12583] 
20 00000005`6597d7c0 00007ffc`bec77d91 coreclr!UnsafeJitFunction+0xe1f [f:\workspace\_work\1\s\src\vm\jitinterface.cpp @ 13070] 
21 00000005`6597dbc0 00007ffc`bec77889 coreclr!MethodDesc::JitCompileCodeLocked+0x2e1 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 974] 
22 00000005`6597ddb0 00007ffc`bec771f4 coreclr!MethodDesc::JitCompileCodeLockedEventWrapper+0x33d [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 830] 
23 00000005`6597def0 00007ffc`becbbdca coreclr!MethodDesc::JitCompileCode+0x194 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 769] 
24 00000005`6597df90 00007ffc`bed595a5 coreclr!MethodDesc::PrepareILBasedCode+0xfa [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 398] 
25 (Inline Function) --------`-------- coreclr!MethodDesc::PrepareCode+0xc [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 330] 
26 00000005`6597dfc0 00007ffc`becb9d06 coreclr!CodeVersionManager::PublishVersionableCodeIfNecessary+0x121 [f:\workspace\_work\1\s\src\vm\codeversion.cpp @ 1800] 
27 00000005`6597e140 00007ffc`becb9815 coreclr!MethodDesc::DoPrestub+0x2b6 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 1972] 
28 00000005`6597e210 00007ffc`bed4db95 coreclr!PreStubWorker+0x455 [f:\workspace\_work\1\s\src\vm\prestub.cpp @ 1798] 
29 00000005`6597e420 00007ffc`5f264362 coreclr!ThePreStub+0x55 [F:\workspace\_work\1\s\src\vm\amd64\ThePreStubAMD64.asm @ 22] 
2a 00000005`6597e4d0 000001c6`00010fa8 0x00007ffc`5f264362
2b 00000005`6597e4d8 00000005`6597e518 0x000001c6`00010fa8
2c 00000005`6597e4e0 000001c6`0000cc90 0x00000005`6597e518
2d 00000005`6597e4e8 00000000`ffffffff 0x000001c6`0000cc90
2e 00000005`6597e4f0 00000000`0000003e 0xffffffff
2f 00000005`6597e4f8 00000000`00000003 0x3e
30 00000005`6597e500 00000000`00000000 0x3
ViktorHofer commented 5 years ago

cc @jkotas

ViktorHofer commented 5 years ago

And for learning purposes, to find the uploaded files (i.e. after a retry) you can use the Helix API directly: https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/files

This will list all uploaded files, in this case I searched for the ".dmp" file and added it to the url: "/4708.wrrjax4w.c3b.dmp".

stephentoub commented 5 years ago

Thanks, @ViktorHofer.

This is happening early on in the invocation of the remote process, as we're looking up the method to invoke: image

jkotas commented 5 years ago

coreclr!memset_repmovs+0x9

This seems to be stuck on a page-in request in the OS. I do not see anything wrong on the runtime side.

The GC is trying to clear memory from 0x000001c600011398 to 0x000001c6000139d8. The clearing is stuck at address 0x000001c600013000 (ie page boundary).

We need to get more dumps for this to see the pattern.

jkotas commented 5 years ago

From just looking at this dump, my bet would be on infrastructure problem: The physical machine is over-subscribed and super slow.

jkotas commented 5 years ago

we're looking up the method to invoke

Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke.

stephentoub commented 5 years ago

Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke.

Right, understood. I was pasting the image from VS to show the managed frames that were obscured in the windbg stack trace as well as the arguments to those methods.

danmoseley commented 5 years ago

Odd that we've had 3 hangs (in remote executor in Regex cache tests). I'd have thought a slow machine might give a more distributed set of hangs..

jkotas commented 5 years ago

They can be each different root cause. Do you happen to have dumps for the other 2 hangs?

ViktorHofer commented 5 years ago

We don't have more dumps yet and we only produce ones on Windows for these kinds of failures.

safern commented 5 years ago

This failed again just now on Windows, here's the dump: https://helix.dot.net/api/2019-06-17/jobs/f20c2934-638e-4637-809b-b960d0287af0/workitems/System.Text.RegularExpressions.Tests/files/6112.xqszggzs.qit.dmp

jkotas commented 5 years ago

https://helix.dot.net/api/2019-06-17/jobs/f20c2934-638e-4637-809b-b960d0287af0/workitems/System.Text.RegularExpressions.Tests/files/6112.xqszggzs.qit.dmp

We are in the middle of JITing System.Text.RegularExpressions.Regex.InitDefaultMatchTimeout called from RegEx static constructor. The thread is not blocked for any obvious reason.

So the common theme is:

Keep more dumps coming...

jkotas commented 5 years ago

Another hit in https://github.com/dotnet/coreclr/pull/27375 . System.Transactions tests. The process is stuck early during startup inside System.Transactions.TransactionsEtwProvider. This hit was on checked build of the runtime that allows us to see what happened in the past using stresslog.

The full stress log is here: log.txt

jkotas commented 5 years ago

The interesting bits of the stress log are:

THREAD  TIMESTAMP     FACILITY                              MESSAGE
  ID  (sec from start)
--------------------------------------------------------------------------------------
173c 153.830905200 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 20006be = 00007FFD94C9C678 (System.Globalization.UnicodeCategory) 
173c 152.449501000 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 200037c = 00007FFD94AB3200 (System.Runtime.Intrinsics.Vector128`1) 

1108 110.031347100 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108 110.031328100 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108 110.031286600 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108 110.031261900 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 3, (requested generation = 2, collect_classes = 9) ==========

1108 110.031259000 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108 110.031258700 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108 110.031258500 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108 110.031258300 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108 110.031258000 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108 110.031257700 : `GC`                 Start of GC stack 

1108  61.804306200 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108  61.804288500 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108  61.804249300 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108  61.804225600 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 2, (requested generation = 1, collect_classes = 9) ==========

1108  61.804222700 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108  61.804222500 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108  61.804222300 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108  61.804222000 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108  61.804221800 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108  61.804221600 : `GC`                 Start of GC stack 

1108  13.291604900 : `GCROOTS`            Scanning Frameless method 00007FFD94A7A918 (System.AppContext.Setup(Char**, Char**, Int32)) ControlPC = 00007FFE16EC2706
1108  13.291578000 : `GCROOTS`            Scanning Frameless method 00007FFD949F64C8 (System.String.Ctor(Char*)) ControlPC = 00007FFE16E93EC9
1108  13.291504100 : `GCROOTS`            Scanning ExplicitFrame 00000082F1F7E3E8 AssocMethod = 00007FFD94AB2468 (System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)) frameVTable = 00007FFDF467BEB0 (coreclr!PrestubMethodFrame::`vftable')
1108  13.291459400 : `GC`GCALLOC`GCROOTS` } =========== BEGINGC 1, (requested generation = 1, collect_classes = 9) ==========

1108  13.291456300 : `GC`                 (GC stack)00007FFDF3E0A07C (coreclr!FinalizerThread::FinalizerThreadWorker+0x15c)
1108  13.291456000 : `GC`                 (GC stack)00007FFDF3E0A84F (coreclr!FinalizerThread::WaitForFinalizerEvent+0x10f)
1108  13.291455800 : `GC`                 (GC stack)00007FFDF42A2546 (coreclr!WKS::GCHeap::GarbageCollect+0x106)
1108  13.291455500 : `GC`                 (GC stack)00007FFDF42A2829 (coreclr!WKS::GCHeap::GarbageCollectGeneration+0x295)
1108  13.291455300 : `GC`                 (GC stack)00007FFDF42B644F (coreclr!WKS::gc_heap::garbage_collect+0x213)
1108  13.291455100 : `GC`                 Start of GC stack 
1108  13.291396100 : `GC`                 GC is triggered 

173c   0.335993200 : `CLASSLOADER`        MethodTableBuilder: finished method table for module 00007FFD947D4020 token 2000175 = 00007FFD94AB2D30 (System.SpanHelpers) 
jkotas commented 5 years ago
jkotas commented 5 years ago

This confirms my theory that the machine is oversubscribed when this hang/timeout happens. We need to find out what else is going on the machine that is taking all resources.

jkotas commented 5 years ago

https://github.com/dotnet/arcade/pull/4182

danmoseley commented 5 years ago

Another one on RHEL7. No dump because of https://github.com/dotnet/arcade/issues/4153, but for the record: netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_culture_and_options

https://dev.azure.com/dnceng/public/_build/results?buildId=398811&view=ms.vss-test-web.build-test-results-tab&runId=12445240&resultId=175274&paneView=debug

jkotas commented 5 years ago

Repro with the more verbose logging from https://github.com/dotnet/coreclr/pull/27554: console.0680a4aa (1).txt

jkotas commented 5 years ago

The machine memory load is 94% that is way too much. The top processes consuming memory

Process Private Memory
svchost 8,848,990,208
dotnet 3,262,791,680
MsMpEng 136,298,496
ServerManager 99,528,704
WindowsAzureTelemetryService 55,070,720
WindowsAzureGuestAgent 47,337,472
WaAppAgent 46,862,336
python 24,686,592

Looks like a svchost process got out of control.

jkotas commented 5 years ago

I assume that the Helix machines are recycled without rebooting. Is there a way to find out what run on the Helix machine before? I suspect that there is a misbehaving test workitem that leaves the machine in a bad state.

@dotnet/dnceng Could you please help here? We want to find out what is causing svchost memory consumption to go through the roof on a Helix machine. Would it be possible to log the machine memory consumption before and after the workitem and identify jobs that make the workingset of any svchost process to go above 1GB?

MattGal commented 5 years ago

@jkotas correct, and we do make an earnest effort to kill any processes started as part of a run. For any given work item it's trivial to query every work item that ran on that machine and I will look into this specific one shortly.

MattGal commented 5 years ago

@jkotas that machine ran precisely two work items between being created and the RegEx test.

Logs: ConsoleUri https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-coreclr-refs-pull-27554-merge-75efd54249d343ebae/System.Runtime.Handles.Tests/console.ba534629.log?sv=2018-03-28&sr=c&sig=fgt2qYutHNBZIKtCdkhSs%2FupTo8ZDEuZTT0lOTCizl4%3D&se=2019-11-09T18%3A13%3A05Z&sp=rl https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-coreclr-refs-pull-27554-merge-75efd54249d343ebae/System.ComponentModel.Composition.Registration.Tests/console.fad9ebf4.log?sv=2018-03-28&sr=c&sig=fgt2qYutHNBZIKtCdkhSs%2FupTo8ZDEuZTT0lOTCizl4%3D&se=2019-11-09T18%3A13%3A05Z&sp=rl

Kusto query:

WorkItems
| where QueueName == "windows.10.amd64.open"
| where MachineName  == "b000MOS"
| order by Finished asc

so ... it's possible those other two work items spawned unnoticed somehow-unkillable processes (I'll check their harness logs after my next meeting) that consumed all 7 GB of RAM on this machine, but since there's only 2 it seems unlikely .

MattGal commented 5 years ago

Harness logs were pretty boring too: LogUri https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-coreclr-refs-pull-27554-merge-75efd54249d343ebae/System.ComponentModel.Composition.Registration.Tests/ba80c381-36ae-4daa-a1c2-4dd7a7bc3770.log?sv=2018-03-28&sr=c&sig=fgt2qYutHNBZIKtCdkhSs%2FupTo8ZDEuZTT0lOTCizl4%3D&se=2019-11-09T18%3A13%3A05Z&sp=rl https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-coreclr-refs-pull-27554-merge-75efd54249d343ebae/System.Runtime.Handles.Tests/e2a27f45-aa3e-4cae-a895-a2ffdcd4bbf2.log?sv=2018-03-28&sr=c&sig=fgt2qYutHNBZIKtCdkhSs%2FupTo8ZDEuZTT0lOTCizl4%3D&se=2019-11-09T18%3A13%3A05Z&sp=rl

jkotas commented 5 years ago

I have done an experiment to capture the memory utilization on Helix machine when running RegularExpressions tests. The top processes consuming memory:

Process Private Memory
dotnet 3,271,024,640.00
ServerManager 97,714,176.00
WindowsAzureTelemetryService 53,587,968.00
WaAppAgent 47,771,648.00
WindowsAzureGuestAgent 47,554,560.00
explorer 35,155,968.00
python 26,943,488.00
svchost 18,296,832.00

So everything is very similar to the failure case, except the 9GB svchost.

MattGal commented 5 years ago

@jkotas I'm having a hard time inferring whether there's a problem here (that list seems normal to me based off what I know is running there... the Azure stuff is non-negotiable and Server manager running on startup is a default Windows Server behavior). Can you clarify if you think there's an infra-related problem?

jkotas commented 5 years ago

I am adding an extra instrumentation to figure which service is eating the memory: https://github.com/dotnet/arcade/pull/4253

It will require a few more peeling the onion iterations to find the root cause. My bet is that the root cause will be infrastructure related.

MattGal commented 5 years ago

I am adding an extra instrumentation to figure which service is eating the memory: dotnet/arcade#4253

It will require a few more peeling the onion iterations to find the root cause. My bet is that the root cause will be infrastructure related.

I'd be surprised, but I'm following this thread eagerly and will get an issue created to update the machines if you find something. We don't actually install any Windows services on that queue so I'd be surprised to see that be the root cause.

safern commented 4 years ago

@jkotas it seems like this test run hit a Timeout in the remote process and it dump the logging you added:

    System.Tests.SetEnvironmentVariable.EnvironmentVariableValueTooLarge_Throws [FAIL]
      Half-way through waiting for remote process.
      Memory load: 99
      Process: 0 Idle PrivateMemory: 0
      Process: 4 System PrivateMemory: 131072
      Process: 296 smss PrivateMemory: 409600
      Process: 384 csrss PrivateMemory: 1937408
      Process: 464 wininit PrivateMemory: 962560
      Process: 472 csrss PrivateMemory: 1974272
      Process: 524 winlogon PrivateMemory: 1953792
      Process: 596 services PrivateMemory: 3325952
      Process: 604 lsass PrivateMemory: 5713920
      Process: 688 svchost PrivateMemory: 5955584
      Process: 744 svchost PrivateMemory: 4235264
      Process: 864 dwm PrivateMemory: 17993728
      Process: 892 svchost PrivateMemory: 4411392
      Process: 900 svchost PrivateMemory: 5726212096
      Process: 964 svchost PrivateMemory: 15978496
      Process: 992 svchost PrivateMemory: 11792384
      Process: 376 svchost PrivateMemory: 11730944
      Process: 348 svchost PrivateMemory: 7675904
      Process: 1092 svchost PrivateMemory: 3018752
      Process: 1308 svchost PrivateMemory: 7344128
      Process: 1316 VSSVC PrivateMemory: 1572864
      Process: 1532 svchost PrivateMemory: 1712128
      Process: 1960 spoolsv PrivateMemory: 5685248
      Process: 2008 svchost PrivateMemory: 3768320
      Process: 2024 svchost PrivateMemory: 7680000
      Process: 1264 svchost PrivateMemory: 2101248
      Process: 1608 MsMpEng PrivateMemory: 115113984
      Process: 320 svchost PrivateMemory: 4354048
      Process: 1804 WaAppAgent PrivateMemory: 46907392
      Process: 1668 WindowsAzureGuestAgent PrivateMemory: 43810816
      Process: 2020 WindowsAzureTelemetryService PrivateMemory: 57057280
      Process: 2064 svchost PrivateMemory: 7237632
      Process: 2092 sqlwriter PrivateMemory: 1744896
      Process: 3164 RuntimeBroker PrivateMemory: 8122368
      Process: 3224 rundll32 PrivateMemory: 3051520
      Process: 3244 svchost PrivateMemory: 3911680
      Process: 3268 taskhostw PrivateMemory: 4669440
      Process: 3364 sihost PrivateMemory: 4149248
      Process: 3572 explorer PrivateMemory: 22745088
      Process: 3660 svchost PrivateMemory: 2891776
      Process: 860 rundll32 PrivateMemory: 3244032
      Process: 1708 ShellExperienceHost PrivateMemory: 16338944
      Process: 4212 ServerManager PrivateMemory: 97263616
      Process: 4276 SearchUI PrivateMemory: 15568896
      Process: 4904 WmiPrvSE PrivateMemory: 1957888
      Process: 5716 dhandler PrivateMemory: 14733312
      Process: 5816 WaSecAgentProv PrivateMemory: 688128
      Process: 5880 cmd PrivateMemory: 1765376
      Process: 5888 conhost PrivateMemory: 5382144
      Process: 5952 conhost PrivateMemory: 1171456
      Process: 4672 msdtc PrivateMemory: 2732032
      Process: 4240 python PrivateMemory: 5480448
      Process: 4792 python PrivateMemory: 14544896
      Process: 3600 python PrivateMemory: 27873280
      Process: 4164 WindowsAzureNetAgent PrivateMemory: 1929216
      Process: 4200 VFPlugin PrivateMemory: 2756608
      Process: 4364 conhost PrivateMemory: 1355776
      Process: 3108 MpCmdRun PrivateMemory: 2871296
      Process: 2180 taskhostw PrivateMemory: 1159168
      Process: 3296 cmd PrivateMemory: 2207744
      Process: 6072 dotnet PrivateMemory: 194486272
      Process: 4788 dotnet PrivateMemory: 2191015936
      Process: 3104 dotnet PrivateMemory: 2084864

      Image Name                     PID Services                                    
      ========================= ======== ============================================
      svchost.exe                    688 BrokerInfrastructure, DcomLaunch, LSM,      
                                         PlugPlay, Power, SystemEventsBroker         
      svchost.exe                    744 RpcEptMapper, RpcSs                         
      svchost.exe                    892 TermService                                 
      svchost.exe                    900 Appinfo, CertPropSvc, DsmSvc, gpsvc,        
                                         iphlpsvc, lfsvc, ProfSvc, sacsvr, Schedule, 
                                         SENS, SessionEnv, ShellHWDetection, Themes, 
                                         UserManager, Winmgmt, WpnService, wuauserv  
      svchost.exe                    964 Dhcp, EventLog, lmhosts, TimeBrokerSvc,     
                                         vmictimesync                                
      svchost.exe                    992 NcbService, PcaSvc, TrkWks, UALSVC,         
                                         UmRdpService, vmickvpexchange,              
                                         vmicshutdown, vmicvss, WdiSystemHost        
      svchost.exe                    376 BFE, CoreMessagingRegistrar, DPS, MpsSvc,   
                                         pla                                         
      svchost.exe                    348 CDPSvc, EventSystem, FontCache, netprofm,   
                                         nsi, W32Time, WinHttpAutoProxySvc           
      svchost.exe                   1092 vmicheartbeat, vmicrdv                      
      svchost.exe                   1308 CryptSvc, Dnscache, LanmanWorkstation,      
                                         NlaSvc, WinRM                               
      svchost.exe                   1532 Wcmsvc                                      
      svchost.exe                   2008 AppHostSvc                                  
      svchost.exe                   2024 DiagTrack                                   
      svchost.exe                   1264 LanmanServer                                
      svchost.exe                    320 W3SVC, WAS                                  
      svchost.exe                   2064 StateRepository, tiledatamodelsvc           
      svchost.exe                   3244 CDPUserSvc_549b4, OneSyncSvc_549b4          
      svchost.exe                   3660 AppReadiness                                
      Timed out at 3/20/2020 9:03:40 PM after 60000ms waiting for remote process.
      Wrote mini dump to: C:\h\w\B4F709DF\w\A5E608BA\uploads\4788.3sdz2mzz.mdg.dmp
        Process ID: 4788
        Handle: 1736
        Name: dotnet
        MainModule: C:\h\w\B4F709DF\p\dotnet.exe
        StartTime: 3/20/2020 9:01:47 PM
        TotalProcessorTime: 00:00:04.8750000

      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(219,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/libraries/System.Runtime.Extensions/tests/System/Environment.SetEnvironmentVariable.cs(118,0): at System.Tests.SetEnvironmentVariable.EnvironmentVariableValueTooLarge_Throws()

https://dev.azure.com/dnceng/public/_build/results?buildId=567555&view=logs

and this is the dump: https://helix.dot.net/api/2019-06-17/jobs/a67b3e67-8ba0-47bd-b6b3-f5c69172fed6/workitems/System.Runtime.Extensions.Tests/files/4788.3sdz2mzz.mdg.dmp

jkotas commented 4 years ago

So the top processes consuming memory are:

PID Process Private Memory
900 svchost 5726212096
4788 dotnet 2191015936
6072 dotnet 194486272
1608 MsMpEng 115113984
4212 ServerManager 97263616
2020 WindowsAzureTelemetryService 57057280
jkotas commented 4 years ago

PID 900 is running these services:

So one these is misbehaving.

jkotas commented 4 years ago

I took a snapshot in https://github.com/dotnet/runtime/pull/33905 of what this service process contains under normal circumstances:

AeLookupSvc, Appinfo, BITS, CertPropSvc,    
gpsvc, iphlpsvc, LanmanServer, ProfSvc,     
sacsvr, Schedule, SENS, SessionEnv,         
ShellHWDetection, Themes, Winmgmt, wuauserv  

So this makes the following services that are not running under normal circumstances to be the likely culprits:

jkotas commented 4 years ago

@MattGal Would it be possible to find out how many CI test machines have a running Device Setup Manager and Geolocation Service services, and what is starting these services on those machines?

jkotas commented 4 years ago

Geolocation Service in particular is not something I would expect to be running on CI test machines.

Chrisboh commented 4 years ago

@dotnet/dnceng can you check to see why Device Setup Manager and Geolocation service are running on these machines?

ilyas1974 commented 4 years ago

I will have FR take a look at this today. (Issue dotnet/core-eng/#9444)

MattGal commented 4 years ago

@dotnet/dnceng can you check to see why Device Setup Manager and Geolocation service are running on these machines?

Couple notes here:

System.Xml.Linq.xNodeBuilder.Tests
System.ComponentModel.Composition.Registration.Tests
System.Globalization.Calendars.Tests
System.Net.Http.Functional.Tests
System.Net.Primitives.Pal.Tests
System.Reflection.CoreCLR.Tests

In short, this is a pretty default Windows 10 machine running on a D2_v3 Azure VM (that's only 8 GB of RAM, which never was much for a Win10 server). I doubt there's much we can do in terms of investigation, but we could explicitly disable these services if we truly think they're memory hogs.

jkotas commented 4 years ago

These services are best guess at what's causing the problem. We are not sure whether they are really the problem.

Does Helix have something like a "health check" of a machine that runs before workitem starts executing? Can the health check look at the available memory before the workitem starts, and recycle the machine when it sees that the machine is in a bad state?