dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.61k stars 4.56k forks source link

Process.WaitForExit always times out after a different particular process runs #88862

Closed dcpower0 closed 12 months ago

dcpower0 commented 1 year ago

I am having some trouble understanding what is happening with a process launched with the Process class with a C# app running on Linux arm64 and I am hoping I've found right place to seek help.

The app use the Process class to grab various pieces of interesting data from Linux using "bash -c". The app also has a browser based on a library called SpiderEye that uses webkitgtk.

What I see is that the Process to call bash works find until I launch the browser. Then suddenly the process times out and is never successful again. If do not launch the browser, the bash process works as expected.

Oddly the process does run, it just never completes. Here is a output trace - note the lines that start with Success and Timeout:

60.8
Success: Bash-c "top -b -n 1 | grep -m 1 %Cpu | awk '{print $8}'"
Loaded '/data/bin/run/SpiderEye.Linux.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
Loaded '/data/bin/run/SpiderEye.Core.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
Loaded '/data/bin/run/Rebex.Castle.dll'. Module was built without symbols.
Loaded '/usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.11/netstandard.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
Could not determine the accessibility bus address
The thread 0x1703 has exited with code 0 (0x0).

** (dotnet:5858): WARNING **: 18:44:45.885: GDK is not able to create a GL context, falling back to glReadPixels (slow!): No GL implementation is available
22.2
Timeout: Bash-c "top -b -n 1 | grep -m 1 %Cpu | awk '{print $8}'"
92.9
Timeout: Bash-c "top -b -n 1 | grep -m 1 %Cpu | awk '{print $8}'"

After reviewing the various issues related to redirection, I focused trying different ways to redirect - either no redirection (as in the trace), or redirection to a file in the bash -c command, or redirection using the Process. The behavior didn't change.

As a result I get zombie bash processes piling up. Notice the child created by process of bash -c is reaped, but bash is not. image

I've got strace going and poking through a log but so far it hasn't helped my understanding of where the problem lies.

Can anyone comment on this?

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-diagnostics-process See info in area-owners.md if you want to be subscribed.

Issue Details
I am having some trouble understanding what is happening with a process launched with the Process class with a C# app running on Linux arm64 and I am hoping I've found right place to seek help. The app use the Process class to grab various pieces of interesting data from Linux using "bash -c". The app also has a browser based on a library called SpiderEye that uses webkitgtk. What I see is that the Process to call bash works find until I launch the browser. Then suddenly the process times out and is never successful again. If do not launch the browser, the bash process works as expected. Oddly the process does run, it just never completes. Here is a output trace - note the lines that start with Success and Timeout: ``` 60.8 Success: Bash-c "top -b -n 1 | grep -m 1 %Cpu | awk '{print $8}'" Loaded '/data/bin/run/SpiderEye.Linux.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled. Loaded '/data/bin/run/SpiderEye.Core.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled. Loaded '/data/bin/run/Rebex.Castle.dll'. Module was built without symbols. Loaded '/usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.11/netstandard.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled. Could not determine the accessibility bus address The thread 0x1703 has exited with code 0 (0x0). ** (dotnet:5858): WARNING **: 18:44:45.885: GDK is not able to create a GL context, falling back to glReadPixels (slow!): No GL implementation is available 22.2 Timeout: Bash-c "top -b -n 1 | grep -m 1 %Cpu | awk '{print $8}'" 92.9 Timeout: Bash-c "top -b -n 1 | grep -m 1 %Cpu | awk '{print $8}'" ``` After reviewing the various issues related to redirection, I focused trying different ways to redirect - either no redirection (as in the trace), or redirection to a file in the bash -c command, or redirection using the Process. The behavior didn't change. As a result I get zombie bash processes piling up. Notice the child created by process of bash -c is reaped, but bash is not. ![image](https://github.com/dotnet/runtime/assets/138262433/65cb5c1f-e4f1-46b8-ba31-c6a64964c6e6) I've got strace going and poking through a log but so far it hasn't helped my understanding of where the problem lies. Can anyone comment on this?
Author: dcpower0
Assignees: -
Labels: `area-System.Diagnostics.Process`
Milestone: -
adamsitnik commented 1 year ago

@tmds could you help us here?

tmds commented 12 months ago

What version of .NET runtime version are you using?

The app use the Process class to grab various pieces of interesting data from Linux using "bash -c". The app also has a browser based on a library called SpiderEye that uses webkitgtk.

SpiderEye/webkitgtk could be reaping the bash process that is started by .NET. You can find some guidance to debug this in https://github.com/dotnet/runtime/issues/33297.

ghost commented 12 months ago

This issue has been marked needs-author-action and may be missing some important information.

dcpower0 commented 12 months ago

I am using .NET runtime version 6.0.11.

I'll have a look at that issue and see if that helps.

Thanks!

dcpower0 commented 12 months ago

I've used strace to capture some activity. I used the debugger to halt before first time bash is called, then started strace using the configuration indicated in #33297. Then I continued the dotnet application and halting at the bash call three times. This captured the system calls for the first successful call to bash, the starting up of the webkit processes, and two calls to bash that timed out. There is a difference between the success bash call and the timed out ones. I can't say I know what it means though. Does the capture below look like it could be the root cause? A webkitgtk library reaping another processes child process?

There is a curious entry for "[pid 11487] wait4(11515, 0xaaab0b364974, WNOHANG, NULL) = 0" after the timed out call to bash. It's curious because when I look for pid 11487 with ps -A, it doesn't exist. Seems weird, but maybe a red herring.

Capture below (I've trimmed some SIGTRAP lines to shorten it) root 11405 33.6 11.5 3671240 115168 ? Ssl 17:34 0:06 _ /data/vsdbg/vsdbg --interpreter=vscode SHELL=/bin/sh PW root 11418 44.5 8.6 4189904 86444 ? Sl 17:34 0:08 _ /usr/share/dotnet/dotnet _StartupHWTargetGen2Device cxchp2:/usr/share # strace -qq -f -e wait4,waitpid,waitid,execve -p 11418 [pid 11456] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=630141260, si_uid=65535} --- [pid 11465] execve("/bin/bash", ["/bin/bash", "-c", "echo hello"], 0xfffeac004b10 / 15 vars /) = 0 [pid 11456] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=11465, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- [pid 11467] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=633686624, si_uid=65535} --- [pid 11468] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=633686624, si_uid=65535} --- [pid 11433] waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=11465, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WNOWAIT, NULL) = 0 [pid 11433] wait4(11465, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 11465 [pid 11433] waitid(P_ALL, 0, 0xffff9aefe510, WNOHANG|WEXITED|WNOWAIT, NULL) = -1 ECHILD (No child processes) [pid 11476] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=633686624, si_uid=65535} --- [pid 11476] --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=11418, si_uid=0} --- [pid 11476] --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=11418, si_uid=0} --- [pid 11477] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=633686624, si_uid=65535} --- [pid 11512] execve("/usr/libexec/webkit2gtk-4.0/WebKitNetworkProcess", ["/usr/libexec/webkit2gtk-4.0/WebK"..., "8", "117"], 0xaaab0b02dd00 / 16 vars / <unfinished ...> [pid 11513] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=633686624, si_uid=65535} --- [pid 11512] <... execve resumed>) = 0 [pid 11418] wait4(11512, 0xaaab0b0388e4, WNOHANG, NULL) = 0 [pid 11514] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=633686624, si_uid=65535} --- [pid 11515] execve("/usr/libexec/webkit2gtk-4.0/WebKitWebProcess", ["/usr/libexec/webkit2gtk-4.0/WebK"..., "12", "120"], 0xaaab0b3f6310 / 16 vars /) = 0 [pid 11418] wait4(11515, 0xaaab0b364974, WNOHANG, NULL) = 0 [pid 11456] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=644039824, si_uid=65535} --- [pid 11456] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=630141260, si_uid=65535} --- [pid 11558] execve("/bin/bash", ["/bin/bash", "-c", "echo hello"], 0xfffeac02e5f0 / 15 vars /) = 0 [pid 11476] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=11558, si_uid=0, si_status=0, si_utime=0, si_stime=1} --- [pid 11487] wait4(11515, 0xaaab0b364974, WNOHANG, NULL) = 0 [pid 11487] wait4(11512, 0xaaab0b0388e4, WNOHANG, NULL) = 0 [pid 11456] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=644039824, si_uid=65535} --- [pid 11456] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=630141260, si_uid=65535} --- [pid 11574] execve("/bin/bash", ["/bin/bash", "-c", "echo hello"], 0xfffeac02e5f0 / 15 vars /) = 0 [pid 11456] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=11574, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- [pid 11487] wait4(11515, 0xaaab0b364974, WNOHANG, NULL) = 0 [pid 11487] wait4(11512, 0xaaab0b0388e4, WNOHANG, NULL) = 0 [pid 11456] --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=644039824, si_uid=65535} ---

tmds commented 12 months ago

The first time bash dies we can see .NET look for children that terminated using waitid. The second and third time that no longer happens.

It seems the .NET SIGCHLD handler is overridden by SpiderEye/webkitgtk.

That matches with this reported issue: https://gitlab.gnome.org/GNOME/glib/-/issues/2216

dcpower0 commented 12 months ago

I am thoroughly impressed with you being able to so quickly pin down a suspected root cause, and even link to the problem and fix in the glib repo!

After updating glib to a later version that contains that fix described in the linked issue above, I find that the issue has gone away. My dotnet processes calling bash work as expected and no more zombie processes.

Thank you kindly for your time.

dcpower0 commented 12 months ago

I consider this resolved.