Open jacoballen73 opened 3 years ago
@jacoballen73 thank you for the report! If you still have the dump available please open a Visual Studio feedback item (this can be done in VS using the "Send Feedback" option, and then attach appropriate diagnostics artifacts to the ticket. This might give us some pointers to what was causing the issue.
FYI @hoyosjs @josalem
@jacoballen73 thank you for the report! If you still have the dump available please open a Visual Studio feedback item (this can be done in VS using the "Send Feedback" option, and then attach appropriate diagnostics artifacts to the ticket. This might give us some pointers to what was causing the issue.
Just submitted the feedback with the attached .dmp file: https://developercommunity.visualstudio.com/t/Dump-file-attachment-for-dotnet-issue-2/1568899 Hopefully you are able to access the file from this.
This is an older non-investigated issue. We will investigate the dump and then take the next appropriate action after that.
@jacoballen73 I'm just letting you know that we are investigating this issue now. Sorry for the delay.
@jacoballen73 Unfortunately my linux distro doesn't match the dump, and so it isn't able to walk the native callstacks. Do you happen to know what docker container would match the os when the dump was collected.
Here's the info I was able to pull from the dump. Initial takeaways are:
(lldb) print g_fSuspensionPending
(Volatile<int>) $24 = (m_val = 1)
(lldb) clrthreads
ThreadCount: 100
UnstartedThread: 0
BackgroundThread: 99
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
Lock
DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
1 1 1 00005C5D475235A0 2020020 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
61 2 41 00005C5D4787EAD0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Finalizer)
63 3 43 0000765574000C50 1020220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
65 6 48 00005C5D47B99D40 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
66 19 57 0000765658001530 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (GC)
67 20 58 0000765658002FB0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
68 21 59 0000765658004C70 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
69 22 5a 0000765658006980 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
70 23 5b 00007656580086F0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
71 24 5c 000076565800A460 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
72 25 5d 000076565800C1D0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
73 26 5e 000076565800DF40 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
74 27 5f 000076565800FCB0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
75 28 60 0000765658011A20 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
76 29 61 0000765658013790 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
77 30 62 0000765658015500 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
78 31 63 0000765658017270 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
79 32 64 0000765658018FE0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
80 33 65 000076565801AF60 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
81 34 66 000076565801CCD0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
82 35 67 000076565801EA40 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
83 36 68 00007656580207B0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
84 37 69 0000765658022520 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
85 38 6a 0000765658024290 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
86 39 6b 0000765658026000 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
87 40 6c 0000765658027D70 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
88 41 6d 0000765658029AE0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
89 42 6e 000076565802B850 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
90 43 6f 000076565802D5C0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
91 44 70 000076565802F330 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
92 45 71 00007656580310A0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
93 46 72 0000765658032E10 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
94 47 73 0000765658034B80 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
95 48 74 00007656580368F0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
96 49 75 0000765658038660 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
97 50 76 000076565803A3D0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
98 51 77 000076565803C140 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
99 52 78 000076565803DEB0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
100 53 79 000076565803FC20 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
101 54 7a 0000765658041990 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
102 55 7b 0000765658043700 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
103 56 7c 0000765658045470 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
104 57 7d 00007656580471E0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
105 58 7e 0000765658048F50 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
106 59 7f 000076565804ACC0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
107 60 80 000076565804CA30 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
108 61 81 000076565804E7A0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
109 62 82 0000765658050510 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
110 63 83 0000765658052280 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
111 64 84 0000765658053FF0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
112 65 85 0000765658056170 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
113 66 86 0000765658057EE0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
114 67 87 0000765658059C50 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
115 68 88 000076565805B9C0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
116 69 89 000076565805D730 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
117 70 8a 000076565805F4A0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
118 71 8b 0000765658061210 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
119 72 8c 0000765658062F80 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
120 73 8d 0000765658064CF0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
121 74 8e 0000765658066A60 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
122 262 14c 00007651F80061F0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
123 263 14d 00007651780421F0 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
124 264 14e 0000765274009740 21220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn
125 315 6d43 00007652DC00B680 1021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
126 101 6d6a 000076507C00A5A0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
127 148 6d6b 000076518801B680 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
128 87 6d72 000076524001F4E0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
129 301 6d75 00007655440014C0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
130 220 6d7b 0000765188017F70 1021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
131 152 6d85 00007651F800DDE0 3021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
132 5 6d89 00007652D8015A30 1021220 Preemptive 00007659F6C475C8:00007659F6C47E40 00005C5D475CE4F0 1 Ukn (Threadpool Worker)
135 97 6d8e 000076540C0032A0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
134 144 6d8d 0000765318014820 1021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
133 313 6d8a 00007650F4008990 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
136 90 6d90 00007653F8000EA0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
137 184 6d92 000076555030E7F0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
138 11 6d93 000076507C00E980 1021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
139 307 6d96 0000765564004950 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
140 84 6d98 000076537800BAB0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
144 310 6da0 00007652400222A0 1021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
142 155 6d9e 00007653FC002600 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
143 311 6d9f 00007651D8002040 1021220 Preemptive 0000765A16789FD0:0000765A16789FF0 00005C5D475CE4F0 1 Ukn (Threadpool Worker)
141 324 6d9d 000076556415CA10 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
147 108 6da9 0000765384009510 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
146 179 6da8 00007652B4039AC0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
145 139 6da7 00007653EC00D830 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
148 336 6daa 000076528C006780 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
149 115 6dab 00007655440063A0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
150 99 6dac 0000765104001FA0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
151 13 6daf 0000765540007AF0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
152 161 6db0 00007652B4014A70 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
153 9 6db1 000076526C0093B0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
154 214 6db2 00007650F412B800 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
155 339 6db3 0000765104010FD0 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
156 208 6db4 0000765240002E10 1021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
157 113 6db5 0000765378014E00 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
159 308 6db7 0000765540010380 1021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
158 7 6db6 00007651440283F0 1021220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
160 312 6db8 00007652DC13AB80 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
161 96 6db9 00007653EC000C80 1031220 Preemptive 0000000000000000:0000000000000000 00005C5D475CE4F0 0 Ukn (Threadpool Worker)
(lldb)
There is additional evidence that we are trying to allocate memory, one of the threads is blocked on String.Concat (which will try to allocate memory from the GC heap, which could trigger a GC):
OS Thread Id: 0x6d9f
Child SP IP Call Site
0000765532FFB3F8 0000766c1be2400c [HelperMethodFrame: 0000765532ffb3f8]
0000765532FFB540 0000766BA6D6E57D System.String.Concat(System.ReadOnlySpan`1<Char>, System.ReadOnlySpan`1<Char>, System.ReadOnlySpan`1<Char>) [/_/src/System.Private.CoreLib/shared/System/String.Manipulation.cs @ 356]
0000765532FFB5A0 0000766BA6D70565 System.Uri.CombineUri(System.Uri, System.String, System.UriFormat) [/_/src/System.Private.Uri/src/System/Uri.cs @ 5161]
0000765532FFB640 0000766BA6D6FB6B System.Uri.GetCombinedString(System.Uri, System.String, Boolean, System.String ByRef) [/_/src/System.Private.Uri/src/System/Uri.cs @ 597]
0000765532FFB6A0 0000766BA6D6F8CF System.Uri.ResolveHelper(System.Uri, System.Uri, System.String ByRef, Boolean ByRef, System.UriFormatException ByRef) [/_/src/System.Private.Uri/src/System/UriExt.cs @ 696]
0000765532FFB6E0 0000766BA6D6F4FB System.Uri.CreateUri(System.Uri, System.String, Boolean) [/_/src/System.Private.Uri/src/System/Uri.cs @ 453]
The other managed threads seemed to be blocked on various socket operations. So my guess is that the GC wasn't able to make forward progress for some reason which is why the process hung.
It seems this is not a diagnostics issue. Some of our tools, such as dotnet-gcdump, dotnet-trace require that we suspend the runtime, and so if we are stuck in a GC or unable to suspend, then those tools won't be able to make progress either.
Adding @cshung for thoughts on next steps.
I looked at the various thread states and summarized them here:
1 2020020
0x00000020 Legal to Join
0x00020000 Fully initialized
0x02000000 Interruptible
2,19-74,262-264 21220
0x00000020 Legal to Join
0x00000200 Background
0x00001000 CLR Owns
0x00020000 Fully initialized
3 1020220
0x00000020 Legal to Join
0x00000200 Background
0x00020000 Fully initialized
0x01000000 Thread Pool Worker Thread
5, 7, 11, 18, 144, 208, 220, 308-311, 315 1021220
0x00000020 Legal to Join
0x00000200 Background
0x00001000 CLR Owns
0x00020000 Fully initialized
0x01000000 Thread Pool Worker Thread
9, 13, 84-139, 148, 155-184, 214, 301-307, 312-313, 324-339 1031220
0x00000020 Legal to Join
0x00000200 Background
0x00001000 CLR Owns
0x00010000 Reported Dead
0x00020000 Fully initialized
0x01000000 Thread Pool Worker Thread
152 3021220
0x00000020 Legal to Join
0x00000200 Background
0x00001000 CLR Owns
0x00020000 Fully initialized
0x01000000 Thread Pool Worker Thread
0x02000000 Interruptible
@jacoballen73 we have fixes to createdump in .net 7 that should help make the dumps more actionable. .NET Core 3.1 doesn't include the stack unwind data in the dump and so we need the matching linux distro to investigate. Would you mind trying the scenario with .NET 7?
We recently upgraded from netcore3.1 to .net6, this situation was already flaky (doesn't happen super reliably in every single environment its being run in), and we haven't had an exact situation arise like this again (we also now have some liveness probes in place so it can't be stuck for this long anymore). I will try to update here if a similar situation arises, but our own investigations have similarly led us to believe that being stuck inside GC for some reason is part of the core issue.
The unwind info dump generator changes also went into 6.0.9 service release I think. It hasn't been officially released yet, but it will be soon.
@jacoballen73 I am just checking in - did the unwind fix in 6.0.9 (or later) address the issue?
I haven't specifically seen the same issue with the dotnet tools hanging, though some of the core issues with the application still exist.
However, I am now getting a HRESULT: 0x80004005 dump failed when I try to use dotnet-dump to get a memory dump from the application. It's being run on a k8s pod/container, and it is run with sudo permissions as far as I can tell (which seems to be the main issue in other posts that discuss this error), so I still don't know why these dotnet-dump's fail.
Would you mind adding the "--diag" option to your dotnet-dump collect command? It produces a log to your app's console that you will need to capture.
It doesn't change anything in the output for me: Noticed after originally posting, but this pod is giving the 0x00000000 result instead of the one I posted above (which I had seen before on other pods). I get this HRESULT even without the --diag flag.
This looks like a mariner container. Is the host node image also Mariner? If so, dumping/ptrace support isn't working at the moment - it's tracked by https://github.com/dotnet/diagnostics/issues/3423, but they have no fix yet.
Yes it's a mariner container and the node is being run on a mariner VM. Thanks for the heads-up, I will keep my eye on that issue. Do you know if the HRESULT: 0x80004005 error might also be related to that? I got that error in the past still with the mariner host/mariner container/dotnet 6 combination, though it seems I might not be able to repro anymore if I'm getting this new error now.
So the simple answer is no - it's not the exact same issue. 0x80004005
is a very generic error code. A few things that might be responsible for such error codes:
So, with dotnet-dump and .Net 6, I'm seeing this same "Write dump failed - HRESULT 0x00000000" error message, also but the context isn't Containers, or Mariner. It is just an Ubuntu 18.04 image. I suspect there is some environment setup issue affecting this, but I don't even know how I'm supposed to find any diagnostics for this to make it clearer. What should I be looking for?
That's usually a problem where createdump doesn't have the permissions to run or write to the directory - see if createdump is executable by the same user that runs the dotnet process and also make sure that same process user has wirte access to the dump directory.
Answer: just as hoyosjs said, seems that this "Write dump failed - HRESULT 0x00000000" can happen when the user running the debuggee process (launched either by dotnet, or in some other way) does not have execute permissions on the createdump executable (and other binaries that are required) to create a dump, and so some initialization failure happens early on before the debuggee responds to the request that was issued by dotnet-dump. Mitigated with some chmod.
I've opened the following runtime issue to track generating a useful error message when the runtime cannot launch createdump: https://github.com/dotnet/runtime/issues/87352
Description
We have a linux docker container running a dotnetcore 3.1 app which has been completely frozen for the past ~6 days. In this time, it has not published any MDM metrics or been responsive at all. We were able to create a memory dump using dotnet-dump, however many other dotnet diagnostic tools did not work. Specifically, dotnet-trace would hang after making the call (even with a specific duration configured), as would dotnet-gcdump. Attempting to use dotnet-counters monitor would also hang waiting to receive information. @sebastienros was with us during a live debugging session where we attempted to do this. Using top shows that no process is consuming all of the CPU (the primary dotnet process only used on average 0-.3% when we were observing it), and the container as a whole is not unresponsive, as we were still able to exec into it and run various bash commands.
Unfortunately, after being stuck for 6 days, the container has finally restarted, so the live repro of this is lost, but I was informed that creating this issue would still have value. Though this freezing bug does not repro reliably, it is still possible for it to appear again and we may be able to provide more info if that happens.