Closed juliusfriedman closed 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
cc @danmosemsft @jkotas
Tagging subscribers to this area: @tommcdon Notify danmosemsft if you want to be subscribed.
Also noting that I run my process under screen
so I can easily disconnect, when I resume the screen process seems to be stuck and I don't really have a way of knowing if it was my process or the tool etc. (top
reports the process I attempted to trace at 100%
+ but only 3.1%
of my whole cpu is used [32 core machine]) and the program is massively parallel.
Should I assume that the process itself has become unstable also and terminate and restart?
More of the above
So I take it the process is doing something? and maybe it's not unstable? But I still cannot cannot with dotnet-trace
or dotnet-dump
.
Going to give it some time and see what happens.
Can you please reopen in https://github.com/dotnet/diagnostics? Or @tommcdon can transfer..I don't have rights apparently
@dotnet/dotnet-diag
@mikem8361
The process appears to still be running however I believe it should have completed by now as it was making a single threaded reduction of a 854 item list. (A similar run for 763 completed in less than an hour on my local pc with 12 cores). [The Azure machine has 32 cores]
At this point before I kill the process is there anything I can do to provide information which might be helpful to you guys and or myself?
I have about an hour before I either have to kill that process or give it a low priority and start a new one, let me know if you need something collected out if it and how otherwise I will end up killing it in about an hour.
I also wanted to note that I have also ran sudo ulimit -c unlimited
to ensure that core dumps were created so is it possible this process is already crashed and trying to make a core dump? Although from the strace
output and memory usage it seems like it's not... although I cannot be completely sure .
@josalem
just a fyi it's still running.
Given that you ran sudo ulimit -c unlimited
before starting the failing process. You may be able to get a usable core dump by:
coredump_filter
for the process at least 0x3f. See http://man7.org/linux/man-pages/man5/core.5.htmlkill -9 67295
.If we had a core dump it might be useful to @josalem
Catching up on this thread. @juliusfriedman, I'm assuming those strace
logs are from the target process, yes?
Before we kill the process and collect a dump, do you have lldb
(preferred) or gdb
(fallback) installed or can you install those on the box where the target process is running? If you can, let's try to get the process under a debugger and collect a dump without killing it (process save-core
in lldb
and generate-core-file
in gdb
).
Once you've got a coredump, try to get a stack trace for all the threads (bt all
in lldb
and thread apply all bt
in gdb
). That ought to tell us what state the the runtime and specifically the Diagnostics Server is in.
Lastly, could you take a look at your /tmp
directory and list out all files that match the pattern dotnet-diagnostic-*
? There should be one that has the pid of your process. The exception that dotnet-trace
is showing indicates that it couldn't find that socket.
I don't want to hazard a guess about the strace
output just yet. I'm hoping a little context might inform us what the calls to mprotect
are doing.
@josalem, Yes.
lldb
installed.
What should I do next?
lldb -p 67295
process save-core
?
Architecture set to: x86_64-pc-linux.
(lldb) process save-core
error: 'process save-core' takes one arguments:
Usage: process save-core FILE
(lldb) process save-core core.dump
error: Failed to save core file for process: no ObjectFile plugins were able to save a core for this process
(lldb)
AzureUser@machine-t:~$ ls /tmp |grep dotnet-diagnostic-
dotnet-diagnostic-21971-15572610-socket
dotnet-diagnostic-36314-16903719-socket
AzureUser@machine-t:~$ ls -lahust /tmp |grep dotnet-diagnostic-
0 srw------- 1 root root 0 Jul 1 15:01 dotnet-diagnostic-36314-16903719-socket
0 srw------- 1 root root 0 Jul 1 11:19 dotnet-diagnostic-21971-15572610-socket
@josalem, @sdmaclea, @tommcdon anything else I can do?
Just to add, this does not replicate on Windows.
@josalem, @sdmaclea, @tommcdon
Should I run the kill -9 67295
now?
if lldb
isn't writing the coredump, we can use the in-box createdump
tool directly. Inside lldb
you should be able to run image list --verbose
to see a list of loaded modules. in that output you should see the path to libcoreclr.so
. You should find createdump
in that directory. I don't recall what all the arguments on the tool are, but it should be invoked with something like createdump --withheap <pid>
to get a core file.
Once you've got the coredump, let's take a look at the stacktraces of the live process in lldb
. I have a feeling we should see what's going on pretty quickly from those. There isn't a socket for your process in /tmp
, so either something else deleted the socket, or the process has already started the shutdown path and deleted the socket itself. I'm not aware of a code path that deletes the socket except on the shutdown path.
(lldb) image list
[ 0] DA1B8E35-81A4-364E-E4B5-2E6434595D7C-FA1ABC72 /mnt/publish/TripPlanner
[ 1] C0314A9E-B527-8A9E-B196-5B533E0BE338-AF8CA28D 0x00007fff0f10b000 [vdso] (0x00007fff0f10b000)
[ 2] 28C6AADE-70B2-D40D-1F0F-3D0A1A0CAD1A-B816448F /lib/x86_64-linux-gnu/libpthread.so.0
[ 3] 25AD56E9-02E2-3B49-0A9C-CDB08A9744D8-9CB95BCC /lib/x86_64-linux-gnu/libdl.so.2
/usr/lib/debug/lib/x86_64-linux-gnu/libdl-2.27.so
[ 4] F2119A44-A997-5811-4620-C8E9D8E243D7-094F77F6 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
[ 5] A33761AB-8FB4-8531-1B3C-85BF4253099D-7CABE653 /lib/x86_64-linux-gnu/libm.so.6
/usr/lib/debug/lib/x86_64-linux-gnu/libm-2.27.so
[ 6] 679F3AE1-1120-EC7C-483B-C9295345D836-F5C104F7 /lib/x86_64-linux-gnu/libgcc_s.so.1
[ 7] B417C0BA-7CC5-CF06-D1D1-BED6652CEDB9-253C60D0 /lib/x86_64-linux-gnu/libc.so.6
/usr/lib/debug/lib/x86_64-linux-gnu/libc-2.27.so
[ 8] 64DF1B96-1228-382F-E186-84249ED800AB-1DCEAAD4 /lib64/ld-linux-x86-64.so.2
[ 9] 106C8A96-DA2F-2138-0BEB-B44E682EDCA1-B89A4192 /mnt/publish/libhostfxr.so
[ 10] 6F15F72F-14BB-9610-45F1-5499AB9F0623-98083E37 /mnt/publish/libhostpolicy.so
[ 11] 6747E35E-0E7B-FC95-FC28-3951E9FC62A5-BC97A494 /mnt/publish/libcoreclr.so
[ 12] 9826FBDF-57ED-7D69-6513-1074CB3C08B1-009C1CD8 /lib/x86_64-linux-gnu/librt.so.1
/usr/lib/debug/lib/x86_64-linux-gnu/librt-2.27.so
[ 13] 6D26E899-6EDA-1DE2-1613-AE97CCF12216-7B2F4B6D /usr/lib/x86_64-linux-gnu/libnuma.so.1
[ 14] 9EBC4551-0179-D961-1FB5-C14CAD4B767F-29739D7A /mnt/publish/libclrjit.so
[ 15] 11726626-50B9-CACA-5FA7-96CA34D3E47C-F4EE451B /mnt/publish/System.Native.so
[ 16] 8B67603F-B1A8-0776-957A-E830E6267BE1-3C2B178F /mnt/publish/System.Globalization.Native.so
[ 17] 3CABDDE9-BB67-44AA-C535-FABB11399D4D-4E72F932 /usr/lib/x86_64-linux-gnu/libicuuc.so.60
[ 18] C4974FC9-FA56-7DE6-E95C-F1364706B3CF-79175FB5 /usr/lib/x86_64-linux-gnu/libicudata.so.60
[ 19] C74FCD0F-E314-58E1-4A2F-FE21183023F5-81ACD388 /usr/lib/x86_64-linux-gnu/libicui18n.so.60
[ 20] 2A919665-DDA5-6796-E76B-5ABA076632D4-F1ED418E /mnt/publish/System.Security.Cryptography.Native.OpenSsl.so
[ 21] 881B3F4F-DA20-6DF3-909B-141DC7410A08-C2DD4B90 /usr/lib/x86_64-linux-gnu/libssl.so.1.1
[ 22] 81210806-4C0E-61DF-5227-1B3147248999-619E7AFF /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
[ 23] 53BD03AD-A320-ECD7-5955-334A752B75D0-EE3B77F5 /lib/x86_64-linux-gnu/libnss_files.so.2
/usr/lib/debug/lib/x86_64-linux-gnu/libnss_files-2.27.so
[ 24] C501B532-CF37-B5F4-160F-6E9C1B8938E6-46757926 /lib/x86_64-linux-gnu/libnss_dns.so.2
/usr/lib/debug/lib/x86_64-linux-gnu/libnss_dns-2.27.so
[ 25] 390E9CC4-C215-314B-6D8A-DE6D6E28F851-8418039C /lib/x86_64-linux-gnu/libresolv.so.2
/usr/lib/debug/lib/x86_64-linux-gnu/libresolv-2.27.so
(lldb) createdump --withheap
error: 'createdump' is not a valid command.
error: Unrecognized command 'createdump'.
(lldb)
AzureUser@machine-t:~$ sudo /usr/share/dotnet/shared/Microsoft.NETCore.App/3.1.5/createdump --withheap 67295
Writing minidump with heap to file /tmp/coredump.67295
Written 24844054528 bytes (6065443 pages) to core file
@juliusfriedman createdump is an external command-line app that lives right next to libcoreclr
. Also, do you have a mount of the publish directory with the linux files from a Windows machine?
@hoyosjs yes I have the source on that machine in the /mnt
directory. (It was built on windows)
I have the coredump:
AzureUser@machine-t:~$ stat /tmp/coredump.67295
File: /tmp/coredump.67295
Size: 24844251136 Blocks: 48523952 IO Block: 4096 regular file
Device: 801h/2049d Inode: 1814 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-07-06 19:03:26.831192557 +0000
Modify: 2020-07-06 19:03:57.323555203 +0000
Change: 2020-07-06 19:03:57.323555203 +0000
Birth: -
@hoyosjs, @josalem, @sdmaclea, @tommcdon
Whats next?
Can you show the output of bt all
in lldb for the process?
@hoyosjs, @josalem, @sdmaclea, @tommcdon
(lldb) bt all
* thread #1, name = 'MyProgram', stop reason = signal SIGSTOP
* frame #0: 0x00007f4a60816b30 libcoreclr.so`JIT_WriteBarrier
frame #1: 0x00007f49eb699409
frame #2: 0x00007f49eb6992b1
frame #3: 0x00007f49eb6a0c07
frame #4: 0x00007f49eb6a09e7
frame #5: 0x00007f49eb6a08d8
frame #6: 0x00007f49eb6988e1
frame #7: 0x00007f49eb6a2105
frame #8: 0x00007f49eb6a1c78
frame #9: 0x00007f49eaad377e
frame #10: 0x00007f49eb69f173
frame #11: 0x00007f49eb6cdf06
frame #12: 0x00007f49eb152abd
frame #13: 0x00007f49eb14b457
frame #14: 0x00007f49e741a1ba
frame #15: 0x00007f4a608165ff libcoreclr.so`CallDescrWorkerInternal + 124
frame #16: 0x00007f4a60746445 libcoreclr.so`CallDescrWorkerWithHandler(CallDescrData*, int) + 117
frame #17: 0x00007f4a60746f20 libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1264
frame #18: 0x00007f4a6082af72 libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) + 786
frame #19: 0x00007f4a6082b270 libcoreclr.so`Assembly::ExecuteMainMethod(PtrArray**, int) + 272
frame #20: 0x00007f4a60693011 libcoreclr.so`CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*) + 577
frame #21: 0x00007f4a6066bdb4 libcoreclr.so`coreclr_execute_assembly + 212
frame #22: 0x00007f4a60d71413 libhostpolicy.so`___lldb_unnamed_symbol111$$libhostpolicy.so + 755
frame #23: 0x00007f4a60d717d4 libhostpolicy.so`___lldb_unnamed_symbol112$$libhostpolicy.so + 68
frame #24: 0x00007f4a60d71eac libhostpolicy.so`corehost_main + 204
frame #25: 0x00007f4a60fd6fec libhostfxr.so`___lldb_unnamed_symbol386$$libhostfxr.so + 1532
frame #26: 0x00007f4a60fd5b83 libhostfxr.so`___lldb_unnamed_symbol384$$libhostfxr.so + 691
frame #27: 0x00007f4a60fd0bd3 libhostfxr.so`hostfxr_main_startupinfo + 147
frame #28: 0x000000000040f4e4 MyProgram`___lldb_unnamed_symbol141$$MyProgram + 788
frame #29: 0x000000000040fc3e MyProgram`___lldb_unnamed_symbol142$$MyProgram + 142
frame #30: 0x00007f4a61229b97 libc.so.6`__libc_start_main(main=(MyProgram`___lldb_unnamed_symbol142$$MyProgram), argc=4, argv=0x00007fff0f027c98, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fff0f027c88) at libc-start.c:310
frame #31: 0x0000000000403620 MyProgram`_start + 41
thread #2, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a6131cbf9 libc.so.6`__GI___poll(fds=0x00007f4a603bedb8, nfds=1, timeout=-1) at poll.c:29
frame #1: 0x00007f4a60a425de libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(int, unsigned char*, int) + 94
frame #2: 0x00007f4a60a421b8 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadCmdFromProcessPipe(int, CorUnix::CPalSynchronizationManager::SynchWorkerCmd*, void**, unsigned int*) + 56
frame #3: 0x00007f4a60a41e34 libcoreclr.so`CorUnix::CPalSynchronizationManager::WorkerThread(void*) + 148
frame #4: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #5: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #6: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #3, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a62149ed9 libpthread.so.0`__pthread_cond_timedwait + 649
frame #1: 0x00007f4a60a4057b libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 299
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a44da6 libcoreclr.so`Sleep + 102
frame #4: 0x00007f4a6096cbe2 libcoreclr.so`DbgTransportSession::TransportWorker() + 274
frame #5: 0x00007f4a6096ba99 libcoreclr.so`DbgTransportSession::TransportWorkerStatic(void*) + 9
frame #6: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #7: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #8: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #4, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a621499f3 libpthread.so.0`__pthread_cond_wait + 579
frame #1: 0x00007f4a60a40585 libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 309
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a447af libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 1743
frame #4: 0x00007f4a60a44b31 libcoreclr.so`WaitForMultipleObjectsEx + 81
frame #5: 0x00007f4a6096a236 libcoreclr.so`DebuggerRCThread::MainLoop() + 246
frame #6: 0x00007f4a6096a0db libcoreclr.so`DebuggerRCThread::ThreadProc() + 251
frame #7: 0x00007f4a60969df9 libcoreclr.so`DebuggerRCThread::ThreadProcStatic(void*) + 121
frame #8: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #9: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #10: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #5, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a62149ed9 libpthread.so.0`__pthread_cond_timedwait + 649
frame #1: 0x00007f4a60a4057b libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 299
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a447af libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 1743
frame #4: 0x00007f4a60a44b31 libcoreclr.so`WaitForMultipleObjectsEx + 81
frame #5: 0x00007f4a6077625e libcoreclr.so`FinalizerThread::WaitForFinalizerEvent(CLREvent*) + 142
frame #6: 0x00007f4a60776372 libcoreclr.so`FinalizerThread::FinalizerThreadWorker(void*) + 98
frame #7: 0x00007f4a60718645 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 341
frame #8: 0x00007f4a60718cfd libcoreclr.so`ManagedThreadBase::FinalizerBase(void (*)(void*)) + 45
frame #9: 0x00007f4a607765ac libcoreclr.so`FinalizerThread::FinalizerThreadStart(void*) + 188
frame #10: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #11: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #12: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #6, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a6214e23a libpthread.so.0`__waitpid + 74
frame #1: 0x00007f4a60a48aad libcoreclr.so`PROCCreateCrashDump(char**) + 125
frame #2: 0x00007f4a60a48c84 libcoreclr.so`PAL_GenerateCoreDump + 164
frame #3: 0x00007f4a608661e5 libcoreclr.so`DumpDiagnosticProtocolHelper::GenerateCoreDump(DiagnosticsIpc::IpcMessage&, IpcStream*) + 373
frame #4: 0x00007f4a6075d9a4 libcoreclr.so`DiagnosticsServerThread(void*) + 468
frame #5: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #6: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #7: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #7, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a62149ed9 libpthread.so.0`__pthread_cond_timedwait + 649
frame #1: 0x00007f4a60a4057b libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 299
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a44f89 libcoreclr.so`SleepEx + 153
frame #4: 0x00007f4a6073a1cd libcoreclr.so`ThreadpoolMgr::TimerThreadFire() + 157
frame #5: 0x00007f4a6073a095 libcoreclr.so`ThreadpoolMgr::TimerThreadStart(void*) + 181
frame #6: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #7: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #8: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #8, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a621499f3 libpthread.so.0`__pthread_cond_wait + 579
frame #1: 0x00007f4a60a40585 libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 309
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a447af libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 1743
frame #4: 0x00007f4a60a44b31 libcoreclr.so`WaitForMultipleObjectsEx + 81
frame #5: 0x00007f4a60714e24 libcoreclr.so`Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode) + 948
frame #6: 0x00007f4a60710319 libcoreclr.so`Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) + 89
frame #7: 0x00007f4a607e096f libcoreclr.so`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 95
frame #8: 0x00007f4a60715d40 libcoreclr.so`Thread::Block(int, PendingSync*) + 32
frame #9: 0x00007f4a6070ef99 libcoreclr.so`SyncBlock::Wait(int, int) + 569
frame #10: 0x00007f4a609d41b9 libcoreclr.so`ObjectNative::WaitTimeout(bool, int, Object*) + 233
frame #11: 0x00007f49e6f34b57
frame #12: 0x00007f49e6f348d4
frame #13: 0x00007f49e8578fba
frame #14: 0x00007f49e85788f1
frame #15: 0x00007f49e8577629
frame #16: 0x00007f49e6f254b2
frame #17: 0x00007f49e6f2ed61
frame #18: 0x00007f49e6f255ce
frame #19: 0x00007f4a608165ff libcoreclr.so`CallDescrWorkerInternal + 124
frame #20: 0x00007f4a60746445 libcoreclr.so`CallDescrWorkerWithHandler(CallDescrData*, int) + 117
frame #21: 0x00007f4a60746f20 libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1264
frame #22: 0x00007f4a60751273 libcoreclr.so`ThreadNative::KickOffThread_Worker(void*) + 467
frame #23: 0x00007f4a60718645 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 341
frame #24: 0x00007f4a60718c5d libcoreclr.so`ManagedThreadBase::KickOff(void (*)(void*), void*) + 45
frame #25: 0x00007f4a607514e1 libcoreclr.so`ThreadNative::KickOffThread(void*) + 369
frame #26: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #27: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #28: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #9, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a62149ed9 libpthread.so.0`__pthread_cond_timedwait + 649
frame #1: 0x00007f4a60a4057b libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 299
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a447af libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int, int) + 1743
frame #4: 0x00007f4a60a449e4 libcoreclr.so`PAL_WaitForSingleObjectPrioritized + 84
frame #5: 0x00007f4a607e0e30 libcoreclr.so`CLRLifoSemaphore::WaitForSignal(unsigned int) + 32
frame #6: 0x00007f4a607e1247 libcoreclr.so`CLRLifoSemaphore::Wait(unsigned int, unsigned int, unsigned int) + 615
frame #7: 0x00007f4a6073784e libcoreclr.so`ThreadpoolMgr::WorkerThreadStart(void*) + 910
frame #8: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #9: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #10: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
The app appears to be running, would you disagree?
(lldb) bt all
* thread #1, name = 'MyProgram', stop reason = signal SIGSTOP
* frame #0: 0x00007f49eb699495
frame #1: 0x00007f49eb6992b1
frame #2: 0x00007f49eb6a0c07
frame #3: 0x00007f49eb6a09e7
frame #4: 0x00007f49eb6a08d8
frame #5: 0x00007f49eb6988e1
frame #6: 0x00007f49eb6a2105
frame #7: 0x00007f49eb6a1c78
frame #8: 0x00007f49eaad377e
frame #9: 0x00007f49eb69f190
frame #10: 0x00007f49eb6cdf06
frame #11: 0x00007f49eb152abd
frame #12: 0x00007f49eb14b457
frame #13: 0x00007f49e741a1ba
frame #14: 0x00007f4a608165ff libcoreclr.so`CallDescrWorkerInternal + 124
frame #15: 0x00007f4a60746445 libcoreclr.so`CallDescrWorkerWithHandler(CallDescrData*, int) + 117
frame #16: 0x00007f4a60746f20 libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1264
frame #17: 0x00007f4a6082af72 libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) + 786
frame #18: 0x00007f4a6082b270 libcoreclr.so`Assembly::ExecuteMainMethod(PtrArray**, int) + 272
frame #19: 0x00007f4a60693011 libcoreclr.so`CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int* ) + 577
frame #20: 0x00007f4a6066bdb4 libcoreclr.so`coreclr_execute_assembly + 212
frame #21: 0x00007f4a60d71413 libhostpolicy.so`___lldb_unnamed_symbol111$$libhostpolicy.so + 755
frame #22: 0x00007f4a60d717d4 libhostpolicy.so`___lldb_unnamed_symbol112$$libhostpolicy.so + 68
frame #23: 0x00007f4a60d71eac libhostpolicy.so`corehost_main + 204
frame #24: 0x00007f4a60fd6fec libhostfxr.so`___lldb_unnamed_symbol386$$libhostfxr.so + 1532
frame #25: 0x00007f4a60fd5b83 libhostfxr.so`___lldb_unnamed_symbol384$$libhostfxr.so + 691
frame #26: 0x00007f4a60fd0bd3 libhostfxr.so`hostfxr_main_startupinfo + 147
frame #27: 0x000000000040f4e4 MyProgram`___lldb_unnamed_symbol141$$MyProgram + 788
frame #28: 0x000000000040fc3e MyProgram`___lldb_unnamed_symbol142$$MyProgram + 142
frame #29: 0x00007f4a61229b97 libc.so.6`__libc_start_main(main=(MyProgram`___lldb_unnamed_symbol142$$MyProgram), argc=4, argv=0x00007 fff0f027c98, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fff0f027c88) at libc-start.c:310
frame #30: 0x0000000000403620 MyProgram`_start + 41
thread #2, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a6131cbf9 libc.so.6`__GI___poll(fds=0x00007f4a603bedb8, nfds=1, timeout=-1) at poll.c:29
frame #1: 0x00007f4a60a425de libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(int, unsigned char*, int) + 94
frame #2: 0x00007f4a60a421b8 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadCmdFromProcessPipe(int, CorUnix::CPalSynchronizationM anager::SynchWorkerCmd*, void**, unsigned int*) + 56
frame #3: 0x00007f4a60a41e34 libcoreclr.so`CorUnix::CPalSynchronizationManager::WorkerThread(void*) + 148
frame #4: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #5: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #6: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #3, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a62149ed9 libpthread.so.0`__pthread_cond_timedwait + 649
frame #1: 0x00007f4a60a4057b libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigne d int, CorUnix::ThreadWakeupReason*, unsigned int*) + 299
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, boo l, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a44da6 libcoreclr.so`Sleep + 102
frame #4: 0x00007f4a6096cbe2 libcoreclr.so`DbgTransportSession::TransportWorker() + 274
frame #5: 0x00007f4a6096ba99 libcoreclr.so`DbgTransportSession::TransportWorkerStatic(void*) + 9
frame #6: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #7: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #8: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #4, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a621499f3 libpthread.so.0`__pthread_cond_wait + 579
frame #1: 0x00007f4a60a40585 libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigne d int, CorUnix::ThreadWakeupReason*, unsigned int*) + 309
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, boo l, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a447af libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, in t, unsigned int, int, int) + 1743
frame #4: 0x00007f4a60a44b31 libcoreclr.so`WaitForMultipleObjectsEx + 81
frame #5: 0x00007f4a6096a236 libcoreclr.so`DebuggerRCThread::MainLoop() + 246
frame #6: 0x00007f4a6096a0db libcoreclr.so`DebuggerRCThread::ThreadProc() + 251
frame #7: 0x00007f4a60969df9 libcoreclr.so`DebuggerRCThread::ThreadProcStatic(void*) + 121
frame #8: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #9: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #10: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #5, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a62149ed9 libpthread.so.0`__pthread_cond_timedwait + 649
frame #1: 0x00007f4a60a4057b libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigne d int, CorUnix::ThreadWakeupReason*, unsigned int*) + 299
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, boo l, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a447af libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, in t, unsigned int, int, int) + 1743
frame #4: 0x00007f4a60a44b31 libcoreclr.so`WaitForMultipleObjectsEx + 81
frame #5: 0x00007f4a6077625e libcoreclr.so`FinalizerThread::WaitForFinalizerEvent(CLREvent*) + 142
frame #6: 0x00007f4a60776372 libcoreclr.so`FinalizerThread::FinalizerThreadWorker(void*) + 98
frame #7: 0x00007f4a60718645 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 341
frame #8: 0x00007f4a60718cfd libcoreclr.so`ManagedThreadBase::FinalizerBase(void (*)(void*)) + 45
frame #9: 0x00007f4a607765ac libcoreclr.so`FinalizerThread::FinalizerThreadStart(void*) + 188
frame #10: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #11: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #12: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #6, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a6214e23a libpthread.so.0`__waitpid + 74
frame #1: 0x00007f4a60a48aad libcoreclr.so`PROCCreateCrashDump(char**) + 125
frame #2: 0x00007f4a60a48c84 libcoreclr.so`PAL_GenerateCoreDump + 164
frame #3: 0x00007f4a608661e5 libcoreclr.so`DumpDiagnosticProtocolHelper::GenerateCoreDump(DiagnosticsIpc::IpcMessage&, IpcStream*) + 373
frame #4: 0x00007f4a6075d9a4 libcoreclr.so`DiagnosticsServerThread(void*) + 468
frame #5: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #6: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #7: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #7, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a62149ed9 libpthread.so.0`__pthread_cond_timedwait + 649
frame #1: 0x00007f4a60a4057b libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigne d int, CorUnix::ThreadWakeupReason*, unsigned int*) + 299
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, boo l, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a44f89 libcoreclr.so`SleepEx + 153
frame #4: 0x00007f4a6073a1cd libcoreclr.so`ThreadpoolMgr::TimerThreadFire() + 157
frame #5: 0x00007f4a6073a095 libcoreclr.so`ThreadpoolMgr::TimerThreadStart(void*) + 181
frame #6: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #7: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #8: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #8, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a621499f3 libpthread.so.0`__pthread_cond_wait + 579
frame #1: 0x00007f4a60a40585 libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigne d int, CorUnix::ThreadWakeupReason*, unsigned int*) + 309
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, boo l, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a447af libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, in t, unsigned int, int, int) + 1743
frame #4: 0x00007f4a60a44b31 libcoreclr.so`WaitForMultipleObjectsEx + 81
frame #5: 0x00007f4a60714e24 libcoreclr.so`Thread::DoAppropriateWaitWorker(int, void**, int, unsigned int, WaitMode) + 948
frame #6: 0x00007f4a60710319 libcoreclr.so`Thread::DoAppropriateWait(int, void**, int, unsigned int, WaitMode, PendingSync*) + 89
frame #7: 0x00007f4a607e096f libcoreclr.so`CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 95
frame #8: 0x00007f4a60715d40 libcoreclr.so`Thread::Block(int, PendingSync*) + 32
frame #9: 0x00007f4a6070ef99 libcoreclr.so`SyncBlock::Wait(int, int) + 569
frame #10: 0x00007f4a609d41b9 libcoreclr.so`ObjectNative::WaitTimeout(bool, int, Object*) + 233
frame #11: 0x00007f49e6f34b57
frame #12: 0x00007f49e6f348d4
frame #13: 0x00007f49e8578fba
frame #14: 0x00007f49e85788f1
frame #15: 0x00007f49e8577629
frame #16: 0x00007f49e6f254b2
frame #17: 0x00007f49e6f2ed61
frame #18: 0x00007f49e6f255ce
frame #19: 0x00007f4a608165ff libcoreclr.so`CallDescrWorkerInternal + 124
frame #20: 0x00007f4a60746445 libcoreclr.so`CallDescrWorkerWithHandler(CallDescrData*, int) + 117
frame #21: 0x00007f4a60746f20 libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1264
frame #22: 0x00007f4a60751273 libcoreclr.so`ThreadNative::KickOffThread_Worker(void*) + 467
frame #23: 0x00007f4a60718645 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 341
frame #24: 0x00007f4a60718c5d libcoreclr.so`ManagedThreadBase::KickOff(void (*)(void*), void*) + 45
frame #25: 0x00007f4a607514e1 libcoreclr.so`ThreadNative::KickOffThread(void*) + 369
frame #26: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #27: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #28: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
thread #9, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a62149ed9 libpthread.so.0`__pthread_cond_timedwait + 649
frame #1: 0x00007f4a60a4057b libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigne d int, CorUnix::ThreadWakeupReason*, unsigned int*) + 299
frame #2: 0x00007f4a60a40194 libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, boo l, CorUnix::ThreadWakeupReason*, unsigned int*) + 388
frame #3: 0x00007f4a60a447af libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, in t, unsigned int, int, int) + 1743
frame #4: 0x00007f4a60a449e4 libcoreclr.so`PAL_WaitForSingleObjectPrioritized + 84
frame #5: 0x00007f4a607e0e30 libcoreclr.so`CLRLifoSemaphore::WaitForSignal(unsigned int) + 32
frame #6: 0x00007f4a607e1247 libcoreclr.so`CLRLifoSemaphore::Wait(unsigned int, unsigned int, unsigned int) + 615
frame #7: 0x00007f4a6073784e libcoreclr.so`ThreadpoolMgr::WorkerThreadStart(void*) + 910
frame #8: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #9: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #10: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
Actually looking at that it seems it is generating a core dump already... but it's taking a long time?
@hoyosjs, @josalem, @sdmaclea, @tommcdon, @mikem8361
Also: dotnet tool install --global dotnet-gcdump
AzureUser@machine-t:~$ dotnet gcdump collect -p 67295
Writing gcdump to '/home/AzureUser/20200706_073405_67295.gcdump'...
Failed to collect gcdump. Try running with '-v' for more information.
AzureUser@machine-t:~$ dotnet gcdump collect -p 67295 -v
Writing gcdump to '/home/AzureUser/20200706_073555_67295.gcdump'...
0.0s: Creating type table flushing task
1.3s: [Error] Exception during gcdump: Microsoft.Diagnostics.NETCore.Client.ServerNotAvailableException: Process 67295 not running compatible .NET Core runtime.
at Microsoft.Diagnostics.NETCore.Client.IpcClient.GetTransport(Int32 processId) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 63
at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessage(Int32 processId, IpcMessage message, IpcMessage& response) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 104
at Microsoft.Diagnostics.NETCore.Client.EventPipeSession..ctor(Int32 processId, IEnumerable`1 providers, Boolean requestRundown, Int32 circularBufferMB) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsClient/EventPipeSession.cs:line 30
at Microsoft.Diagnostics.NETCore.Client.DiagnosticsClient.StartEventPipeSession(IEnumerable`1 providers, Boolean requestRundown, Int32 circularBufferMB) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsClient/DiagnosticsClient.cs:line 39
at Microsoft.Diagnostics.Tools.GCDump.EventPipeSessionController..ctor(Int32 pid, List`1 providers, Boolean requestRundown) in /_/src/Tools/dotnet-gcdump/DotNetHeapDump/EventPipeDotNetHeapDumper.cs:line 235
at Microsoft.Diagnostics.Tools.GCDump.EventPipeDotNetHeapDumper.DumpFromEventPipe(CancellationToken ct, Int32 processID, MemoryGraph memoryGraph, TextWriter log, Int32 timeout, DotNetHeapInfo dotNetInfo) in /_/src/Tools/dotnet-gcdump/DotNetHeapDump/EventPipeDotNetHeapDumper.cs:line 52
[ 2.6s: Done Dumping .NET heap success=False]
Failed to collect gcdump. Try running with '-v' for more information.
thread #6, name = 'MyProgram', stop reason = signal SIGSTOP
frame #0: 0x00007f4a6214e23a libpthread.so.0`__waitpid + 74
frame #1: 0x00007f4a60a48aad libcoreclr.so`PROCCreateCrashDump(char**) + 125
frame #2: 0x00007f4a60a48c84 libcoreclr.so`PAL_GenerateCoreDump + 164
frame #3: 0x00007f4a608661e5 libcoreclr.so`DumpDiagnosticProtocolHelper::GenerateCoreDump(DiagnosticsIpc::IpcMessage&, IpcStream*) + 373
frame #4: 0x00007f4a6075d9a4 libcoreclr.so`DiagnosticsServerThread(void*) + 468
frame #5: 0x00007f4a60a4a73d libcoreclr.so`CorUnix::CPalThread::ThreadEntry(void*) + 349
frame #6: 0x00007f4a621436db libpthread.so.0`start_thread + 219
frame #7: 0x00007f4a6132988f libc.so.6`__GI___clone at clone.S:95
hmmm, this looks like you used dotnet-dump
after starting a trace and it left the server in a bad state (CC @mikem8361). This doesn't actually explain the underlying issue though. Your diagnostics server socket appears to have been deleted, but your process doesn't seem to be in the shutdown state. I'm not sure why your socket was deleted unless something else on your system deleted it. Without that socket, most of the .NET diagnostics tools won't work.
If you run ps -a | grep createdump
do you see any output? With some poking in lldb
we could deduce the pid of the specific createdump
the runtime is waiting on.
Also, your managed code still appears to be running, so I'm not 100% sure the diagnostics server being stuck is causing the process to still be alive. You won't be able to managed symbols without having SOS installed (dotnet tool install -g dotnet-sos
followed by dotnet sos install
and then you can relaunch lldb
and use commands like clrstack
to see your managed code.
Yeah, that's interesting. If you look at your temp folder, do you see something of the sorts of dotnet-diagnostic-<PID>
for that pid? And yes, thread 6 looks like it was a request from dotnet-dump
to get a dump and it got stuck waiting on it for some reason. Does your process have any child processes hanging off it still?
That thread (6) is waiting for the createdump process to finish/terminate. The createdump process should be there; waitpid should return no matter why createdump terminates. Attaching lldb to the createdump process and doing the “bt all” command might help with why it is hung.
I think the ServerNotAvailableExceptions are happening because the socket isn't there:
AzureUser@machine-t:~$ ls /tmp |grep dotnet-diagnostic-
dotnet-diagnostic-21971-15572610-socket
dotnet-diagnostic-36314-16903719-socket
AzureUser@machine-t:~$ ls -lahust /tmp |grep dotnet-diagnostic-
0 srw------- 1 root root 0 Jul 1 15:01 dotnet-diagnostic-36314-16903719-socket
0 srw------- 1 root root 0 Jul 1 11:19 dotnet-diagnostic-21971-15572610-socket
Something must've cleaned up the socket while createdump handler is running. @josalem do you know if any of your recent changes to diagnostics server can cause this kind of behavior?
do you know if any of your recent changes to diagnostics server can cause this kind of behavior?
The only code paths in 3.1 that unlink the server socket are in the shutdown path as far as I'm aware. I don't believe any of my 5.0 changes would unlink the socket early, but those don't apply here regardless.
@josalem
dotnet tool install -g dotnet-sos
Worked but:
AzureUser@machine-t:~$ dotnet sos install
Could not execute because the specified command or file was not found.
Possible reasons for this include:
* You misspelled a built-in dotnet command.
* You intended to execute a .NET Core program, but dotnet-sos does not exist.
* You intended to run a global tool, but a dotnet-prefixed executable with this name could not be found on the PATH.
AzureUser@machine-t:~$ dotnet install sos
Could not execute because the specified command or file was not found.
Possible reasons for this include:
* You misspelled a built-in dotnet command.
* You intended to execute a .NET Core program, but dotnet-install does not exist.
* You intended to run a global tool, but a dotnet-prefixed executable with this name could not be found on the PATH.
But I eventually got it...
AzureUser@machine-t:~$ cd /home/AzureUser/.dotnet/tools/
AzureUser@machine-t:~/.dotnet/tools$ ls
dotnet-dump dotnet-gcdump dotnet-sos dotnet-trace
AzureUser@machine-t:~/.dotnet/tools$ sudo ./dotnet-sos
Required command was not provided.
Usage:
dotnet-sos [options] [command]
Options:
--version Display version information
Commands:
install Installs SOS and configures LLDB to load it on startup.
uninstall Uninstalls SOS and reverts any configuration changes to LLDB.
AzureUser@machine-t:~/.dotnet/tools$ sudo ./dotnet-sos install
Installing SOS to /home/AzureUser/.dotnet/sos from /home/AzureUser/.dotnet/tools/.store/dotnet-sos/3.1.122203/dotnet-sos/3.1.122203/tools/netcoreapp2.1/any/linux-x64
Creating installation directory...
Copying files...
Creating new /home/AzureUser/.lldbinit file - LLDB will load SOS automatically at startup
SOS install succeeded
AzureUser@machine-t:~/.dotnet/tools$
Just as a FYI the bill for this machine is close to $1k, so we terminated the process, I will inspect the core dump when I have time.
I have restarted the process and will keep you up to date if this occurs again.
I would suggest also adding dotnet-gcdump
to the tags.
Thank you guys for your time and I do not look forward to bothering you again.
If there is anything I can provide you out of the dump without exposing my application code I have no problem sharing something, but I have to be able to see the outputs etc before I share, I probably will not be able to share the raw dump.
interesting part of the clrstack
all output from dotnet dump analyze
OS Thread Id: 0x106e3
Child SP IP Call Site
00007F4A5E637D60 00007f4a62149ed9 [DebuggerU2MCatchHandlerFrame: 00007f4a5e637d60]
OS Thread Id: 0x106e5
Child SP IP Call Site
OS Thread Id: 0x106e8
Child SP IP Call Site
00007F4A59F363D0 00007f4a621499f3 [GCFrame: 00007f4a59f363d0]
00007F4A59F364B0 00007f4a621499f3 [HelperMethodFrame_1OBJ: 00007f4a59f364b0] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
00007F4A59F365E0 00007F49E6F34B57 System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/SemaphoreSlim.cs @ 467]
00007F4A59F36620 00007F49E6F348D4 System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/SemaphoreSlim.cs @ 381]
00007F4A59F366B0 00007F49E8578FBA System.Collections.Concurrent.BlockingCollection`1[[Microsoft.Extensions.Logging.Console.LogMessageEntry, Microsoft.Extensions.Logging.Console]].TryTakeWithNoTimeValidation(Microsoft.Extensions.Logging.Console.LogMessageEntry ByRef, Int32, System.Threading.CancellationToken, System.Threading.CancellationTokenSource) [/_/src/System.Collections.Concurrent/src/System/Collections/Concurrent/BlockingCollection.cs @ 698]
00007F4A59F36780 00007F49E85788F1 System.Collections.Concurrent.BlockingCollection`1+<GetConsumingEnumerable>d__68[[Microsoft.Extensions.Logging.Console.LogMessageEntry, Microsoft.Extensions.Logging.Console]].MoveNext() [/_/src/System.Collections.Concurrent/src/System/Collections/Concurrent/BlockingCollection.cs @ 1659]
00007F4A59F367F0 00007F49E8577629 Microsoft.Extensions.Logging.Console.ConsoleLoggerProcessor.ProcessLogQueue() [/_/src/Logging/Logging.Console/src/ConsoleLoggerProcessor.cs @ 74]
00007F4A59F36890 00007F49E6F254B2 System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 44]
00007F4A59F368B0 00007F49E6F2ED61 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 172]
00007F4A59F36900 00007F49E6F255CE System.Threading.ThreadHelper.ThreadStart() [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 93]
00007F4A59F36C60 00007f4a608165ff [GCFrame: 00007f4a59f36c60]
00007F4A59F36D30 00007f4a608165ff [DebuggerU2MCatchHandlerFrame: 00007f4a59f36d30]
cc @tmds, @stephentoub I recall a work item where you guys worked on the Console
class, is that back ported to 3.1?
I recall a work item where you guys worked on the Console class, is that back ported to 3.1?
There have been a variety of changes to Console. Is there one in particular you're referring to? I skimmed this issue but couldn't find a reference to something System.Console-related. Are you asking because of the ConsoleLoggerProcessor
in the call stack? That's not actually part of System.Console; that's part of Microsoft.Extensions.Logging, and just from looking at the source, this stack would be expected if the background thread it spins up to process log messages is waiting for the next log message to arrive.
I am referring to the one where @tmds fixed some dead locking. Umm yea but would that not write to the console?
Umm yea but would that not write to the console?
Yes, it does eventually. But that's not where this is blocked. This is just sitting waiting for the next item to be added to the queue. It's waiting for someone to log something.
How can you tell that just looking at the trace, since it would be out of process by then and into the loggers process anyway.. right? (Assuming the logger has it's own proc and its not just a Queue) and if it's just a queue is it a blocking collection?
I'm looking just at the part you said was the "interesting part of the clrstack":
00007F4A59F364B0 00007f4a621499f3 [HelperMethodFrame_1OBJ: 00007f4a59f364b0] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
00007F4A59F365E0 00007F49E6F34B57 System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/SemaphoreSlim.cs @ 467]
00007F4A59F36620 00007F49E6F348D4 System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/SemaphoreSlim.cs @ 381]
00007F4A59F366B0 00007F49E8578FBA System.Collections.Concurrent.BlockingCollection`1[[Microsoft.Extensions.Logging.Console.LogMessageEntry, Microsoft.Extensions.Logging.Console]].TryTakeWithNoTimeValidation(Microsoft.Extensions.Logging.Console.LogMessageEntry ByRef, Int32, System.Threading.CancellationToken, System.Threading.CancellationTokenSource) [/_/src/System.Collections.Concurrent/src/System/Collections/Concurrent/BlockingCollection.cs @ 698]
00007F4A59F36780 00007F49E85788F1 System.Collections.Concurrent.BlockingCollection`1+<GetConsumingEnumerable>d__68[[Microsoft.Extensions.Logging.Console.LogMessageEntry, Microsoft.Extensions.Logging.Console]].MoveNext() [/_/src/System.Collections.Concurrent/src/System/Collections/Concurrent/BlockingCollection.cs @ 1659]
00007F4A59F367F0 00007F49E8577629 Microsoft.Extensions.Logging.Console.ConsoleLoggerProcessor.ProcessLogQueue() [/_/src/Logging/Logging.Console/src/ConsoleLoggerProcessor.cs @ 74]
That's just sitting here: https://github.com/dotnet/runtime/blob/c8f1f2bf6457a9b503d0d66d6fcf8aa886944947/src/libraries/Microsoft.Extensions.Logging.Console/src/ConsoleLoggerProcessor.cs#L75 waiting for someone to call https://github.com/dotnet/runtime/blob/c8f1f2bf6457a9b503d0d66d6fcf8aa886944947/src/libraries/Microsoft.Extensions.Logging.Console/src/ConsoleLoggerProcessor.cs#L32.
So it's that blocking collection that is stuck for some reason? and that reason is it's waiting for a write and has nothing to do with dead locking?
So it's that blocking collection that is stuck for some reason?
Based just on that stack, that particular thread is simply waiting for something to be logged. Practically any time you take a dump, you'd see that stack, as it just waits on the collection for something to arrive, quickly pulls it out and writes it to the console, and then loops around and blocks waiting for the next thing.
Well guess what, I think it dead locked writing to the console...
And if it's not then can you say how you KNOW that for sure please so I can clearly understand, because right now I can't.
(Yet it's 12:15 am here and I have been up since 6am and have to be up around that time again tomorrow)
That stack doesn't show it in WriteMessage. It shows it in the blocking collection's enumerator.
00007F4A59F36780 00007F49E85788F1 System.Collections.Concurrent.BlockingCollection`1+<GetConsumingEnumerable>d__68[[Microsoft.Extensions.Logging.Console.LogMessageEntry, Microsoft.Extensions.Logging.Console]].MoveNext() [/_/src/System.Collections.Concurrent/src/System/Collections/Concurrent/BlockingCollection.cs @ 1659]
How can it move next and write if the previous is dead locked? which is why I think that's the last call ProcessLogQueue
if that is not dead lock what is that exactly? I thought I had the stack... turns out that machine is turned off so I might have lost if the disk got wiped. Will let you know next time we boot her up.
I don't understand what you're asking.
It's certainly possible that some code elsewhere is deadlocked and as a result it's not logging and as a result this code is getting nothing to write. But looking solely at the stack trace shared for this thread, this thread appears to be behaving normally.
Stuck in ProcessLogQueue
? Normal? I Disagree. I don't see what I am missing here except sleep. Thank you for your time.
Stuck in ProcessLogQueue? Normal?
Yes. This code has a dedicated thread that sits in a loop accessing a blocking collection. It waits for the next item, takes it, writes it out, and goes back to waiting for the next item. It will spend most of its life blocked waiting for the next piece of data to be logged. When data is logged, the logger adds it to the blocking collection.
The fact it has a dedicated thread points even more at ... idk.. something else... (deadlock).
(My program doesn't log to the console at that point)
Other things to mention here, I will open a new issue you need,trace
is always appended the output file -o
e.g .-o trace.netrace
results in a trace.trace.netrace
file.(Windows an Linux)
Speedscope format does not appear to be working for me, even when I convert from a trace file with known good data my speedscope json is empty.(windows and linux)
Both of those issues seem orthogonal to this issue. Please create a separate issue.
I'm still not sure what caused your issue, but I'm not convinced that it was the tools that caused the issue. The tools don't have any code paths that would delete the underlying transport like you're seeing. The waitpid
x createdump
issue seems odd to me and could be problematic, but ultimately wouldn't be affected by anything in this repo. All that code lives in dotnet/runtime. As @mikem8361 pointed out, waitpid
should stop once createdump
is exited regardless of what causes the exit. Again, that issue shouldn't delete the underlying transport either since createdump
doesn't know anything about the diagnostics server or the transport.
If this is readily reproducible, then we could investigate the behavior on our end.
Out of curiosity, how did you run your application? Specifically, did you run your app with dotnet exec <dll>
directly or did you run it with some daemon like systemd or similar?
I still have the dump and can replicate this on my azure Ubuntu machine every time. I am confused on what the ask is here. I still have the dump or i can take a new one let me know asap.
And by replicate:
public class Test{
public static void Main(){
int i = 0;
while(true){
Thread.Sleep(100);
unchecked{i++;}
}
}
}
Attach to this
dump
try to do it again.
On windows it works fine but on Ubuntu it only works the first time.
Both of those issues seem orthogonal to this issue. Please create a separate issue.
I'm still not sure what caused your issue, but I'm not convinced that it was the tools that caused the issue. The tools don't have any code paths that would delete the underlying transport like you're seeing. The
waitpid
xcreatedump
issue seems odd to me and could be problematic, but ultimately wouldn't be affected by anything in this repo. All that code lives in dotnet/runtime. As @mikem8361 pointed out,waitpid
should stop oncecreatedump
is exited regardless of what causes the exit. Again, that issue shouldn't delete the underlying transport either sincecreatedump
doesn't know anything about the diagnostics server or the transport.If this is readily reproducible, then we could investigate the behavior on our end.
Out of curiosity, how did you run your application? Specifically, did you run your app
Can you suggest the title / repo?
dotnet tools output file name is not correct when extension is specified?
and which repo this one?
And obviously a 2nd for the format issue.
@juliusfriedman what @josalem was asking is how did you execute the app that you are dumping. Was it dotnet run
, or dotnet exec <path to dll>
, or is it the executable version of your app? Do you use single file publishing? Is it running as the same user that you use dotnet-dump? I tried your repro, but I can dump several times.
I believe the microsoft/perfview is the repo that has the logic for format conversion. We've had bugs fixed, but I don't know if your version of dotnet-dump has it (maybe look at dotnet-dump --version
).
For the naming issue, you can open on this repo and we'll investigate where it happens (could be in the tool/library code or the runtime).
@juliusfriedman what @josalem was asking is how did you execute the app that you are dumping. Was it
dotnet run
, ordotnet exec <path to dll>
, or is it the executable version of your app? Do you use single file publishing? Is it running as the same user that you use dotnet-dump? I tried your repro, but I can dump several times.I believe the microsoft/perfview is the repo that has the logic for format conversion. We've had bugs fixed, but I don't know if your version of dotnet-dump has it (maybe look at
dotnet-dump --version
). For the naming issue, you can open on this repo and we'll investigate where it happens (could be in the tool/library code or the runtime).
I am sorry,.
unzip file.zip
sudo cp -r file /mnt/
cd /mnt/file
sudo chmod +x MyProgram
screen
ulimit -c unlmited
./MyProgram arguments
CTRL A + CTRL + D
If your app in a zip reproes it (the one you shared), can you share that. I'll try to take a look soon, with the mount directory and screen. Not sure how that changes things, but I can't really think of a reason you're seeing this behavior.
The other thing I can imagine us testing to rule out some components is: right next to your self-contained you should find createdump
. You can call it as ./createdump -d -h -f <FULL PATH TO DUMP> <PID>
. If you call it a few times, does it work?
Unfortunately we refactored that program to both show more output and be faster so if it does repo I will definitely report it, over 10 + runs on windows on a 12 core system 64 gb ram with no issues thus far.
Moving to the Ubuntu machine for mass data processing very soon.
If this is important to you I will make time to do it now. Please let me know asap.
You should be able to repo this on any azure ubuntu machine with the demo program above....
It's not critical now. I'll try using an 18.04 VM then.
Run the same command again later...
This occurs with both
dotnet-trace
anddotnet-dump
running on: