AlexxIT / go2rtc

Ultimate camera streaming application with support RTSP, RTMP, HTTP-FLV, WebRTC, MSE, HLS, MP4, MJPEG, HomeKit, FFmpeg, etc.
https://github.com/AlexxIT/Blog
MIT License
3.79k stars 309 forks source link

Hang with exec when using bash script (maybe any forked process) #1243

Open tsightler opened 3 days ago

tsightler commented 3 days ago

Hi @AlexxIT,

I'm back with an attempt to provide a more complete description of this issue as well as a simple configuration to reproduce. I'm still not 100% sure of the cause or why it is more pronounced when using bash, but I'm starting to believe it could happen with any command, given enough cycles. It's definitely more pronounced with anything that spawns extra processes, because the default behavior is to use SIGKILL, which will leave child processes running since the signal is only sent to the parent.

However, even when the primary exec process properly handles signals and kills children prior to exiting, the problem still occurs. The simplest way to reproduce the issue is with the following go2rtc config:

log:
  level: debug
streams:
  stream1: exec:ffmpeg -re -i /media/BigBuckBunny.mp4 -c copy -rtsp_transport tcp -f rtsp {output}#killsignal=15
  stream2: exec:bigbuckbunny.sh {output}#killsignal=15

Obviously stream1 just runs ffmpeg directly, but stream 2 instead uses the following simple bash script:

#!/bin/bash
# This script simply runs ffmpeg to stream to RTSP
output=${1}  # URL for publishing RTSP stream

cleanup() {
    # Signal all background processes and wait for them to shutdown prior to exiting
    local pids=$(jobs -pr)
    [ -n "$pids" ] && kill $pids
    wait
    exit 0
}

# Trap signals from go2rtc so that background ffmpeg process can be killed
trap cleanup INT TERM

echo "Starting ffmpeg..." 1>&2
ffmpeg -re -i /media/BigBuckBunny.mp4 -c copy -rtsp_transport tcp -f rtsp ${output} &
wait
cleanup

Technically both stream1 and stream2 should behave the same, and mostly they do. The only difference is that stream2 uses a bash script to spawn ffmpeg in the background, then traps the TERM signal and kills ffmpeg itself prior to existing. You'll still see in the debug output that ffmpeg gets the kill signal and exits, as does the script, but, about 1 time out of 5, the cmd.wait() call will no return, and thus new connection will not cause the script to be run. Here's what a successful exit looks like:

14:26:27.654 DBG [streams] stop producer url=exec:bigbuckbunny.sh {output}#killsignal=15
14:26:27.658 DBG [exec] frame=   49 fps= 21 q=-1.0 Lsize=N/A time=00:00:02.34 bitrate=N/A speed=1.02x
14:26:27.658 DBG [exec] video:657kB audio:37kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
14:26:27.658 DBG [exec] Exiting normally, received signal 15.
14:26:27.764 DBG [rtsp] disconnect stream=stream2

And here is a failed:

14:26:36.401 DBG [streams] stop producer url=exec:bigbuckbunny.sh {output}#killsignal=15
14:26:36.406 DBG [exec] frame=   74 fps= 25 q=-1.0 Lsize=N/A time=00:00:03.06 bitrate=N/A speed=1.02x
video:1026kB audio:48kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
14:26:36.406 DBG [exec] Exiting normally, received signal 15.

So notice there's never a "disconnect stream" message in the second one, and this is because the c.cmd.Wait() never returns. I think there's another subtle hint here in that, in the first example, the last log messages from ffmpeg are both output from go2rtc debug, but on second one, the second line of output from ffmpeg is just straight on stdout. I think this indicates that go has already closed std out for this process.

I'm not 100% sure, but I think this is caused by the fact that there is already a wait() on this process. The go implementation of process.Wait() on Linux appears to use waitid syscall, which is not really designed to have mulitple waiters on the same process and this also seems to be considered bad practice in go in general. I think it is somehow causes a very subtle race condition. The strace between a working an non-working case are also subtly different, I can post those if interested.

Now, I'm not sure what the correct fix is here, it's seems a little complex because there's not a second wait for the pipe case, only for the pure exec case. I thought about having completely separate functions for pipe vs exec, then we could just use a single wait, but again, I'm quite weak with golang, and a few attempts at asking AI didn't quite get me there, but since there's a simple reproducer, I thought I would post here as you might have better understanding and it's certainly possible my analysis is just wrong.

skrashevich commented 3 days ago

add EXIT to trap call

tsightler commented 3 days ago

add EXIT to trap call

It doesn't help. Note that the script above has no way to exit without calling cleanup anyway, but adding it does not change the behavior.

skrashevich commented 3 days ago

What is the purpose of such an implementation of bigbuckbunny.sh? What problem are you trying to solve? IMHO, it could be implemented much more simply, like this:

#!/bin/sh
output=${1}  # URL for publishing RTSP stream
echo "Starting ffmpeg..." >&2
ffmpeg -re -i /media/BigBuckBunny.mp4 -c copy -rtsp_transport tcp -f rtsp ${output} 2> /dev/null
tsightler commented 3 days ago

The purpose of bigbuckbunny.sh is just to provide a simple way to demonstrate the problem that can be easily setup to run anywhere with no depdencies. In the actual use case for my project I don't directly run ffmpeg like this, rather it is a small set of scripts which communicate with an upstream process that starts and stops the RTSP producer stream.

Also, with your example, ffmpeg is not being explicitly killed by a signal as go2rtc only signals the directly spawned parent process. I don't really think this is an issue as I'm fine with management my own child processes (some other projects set the pgid attribute and then send the signal to the group pid so that all processes receive it, but I actually think that is more incorrect). However, for your script bash receives the signal and just exits, so ffmpeg doesn't receive a signal at all and is dependent on the stream being stopped to kill off ffmpeg.

However, even your script will still reproduce the issue if you start and stop the stream enough. I used your script to create a new "stream3" which ran the script named "bigbuckbunny-simple.sh", which is an exact copy of your implementation. I then started and stopped VLC player about 20 times and eventually got the same problem:

10:07:42.733 DBG [rtsp] new consumer stream=stream3
10:07:42.733 DBG [exec] run rtsp args=["bigbuckbunny-simple.sh","rtsp://127.0.0.1:8554/86a6e5194962328a993febdf7428daff"]
10:07:42.734 DBG [exec] Starting ffmpeg...
10:07:42.823 DBG [exec] run rtsp launch=90.047283ms
10:07:42.823 DBG [streams] start producer url=exec:bigbuckbunny-simple.sh {output}#killsignal=15
10:07:43.579 DBG [rtsp] handle error=EOF
10:07:43.579 DBG [streams] stop producer url=exec:bigbuckbunny-simple.sh {output}#killsignal=15
10:07:43.579 TRC [exec] kill with signal=15
10:07:43.580 DBG [rtsp] disconnect stream=stream3
10:07:44.006 DBG [rtsp] new consumer stream=stream3
10:07:44.006 DBG [exec] run rtsp args=["bigbuckbunny-simple.sh","rtsp://127.0.0.1:8554/86a6e5194962328a993febdf7428daff"]
10:07:44.007 DBG [exec] Starting ffmpeg...
10:07:44.093 DBG [exec] run rtsp launch=86.643037ms
10:07:44.093 DBG [streams] start producer url=exec:bigbuckbunny-simple.sh {output}#killsignal=15
10:07:44.850 DBG [rtsp] handle error=EOF
10:07:44.850 DBG [streams] stop producer url=exec:bigbuckbunny-simple.sh {output}#killsignal=15
10:07:44.850 TRC [exec] kill with signal=15
10:07:44.850 DBG [rtsp] disconnect stream=stream3
10:07:45.447 DBG [rtsp] new consumer stream=stream3
10:07:45.447 DBG [exec] run rtsp args=["bigbuckbunny-simple.sh","rtsp://127.0.0.1:8554/86a6e5194962328a993febdf7428daff"]
10:07:45.448 DBG [exec] Starting ffmpeg...
10:07:45.533 DBG [exec] run rtsp launch=86.086422ms
10:07:45.533 DBG [streams] start producer url=exec:bigbuckbunny-simple.sh {output}#killsignal=15
10:07:46.290 DBG [rtsp] handle error=EOF
10:07:46.291 DBG [streams] stop producer url=exec:bigbuckbunny-simple.sh {output}#killsignal=15
10:07:46.291 TRC [exec] kill with signal=15
10:07:46.961 DBG [rtsp] new consumer stream=stream3
10:07:55.579 DBG [rtsp] new consumer stream=stream3
10:08:03.165 DBG [rtsp] new consumer stream=stream3
10:08:10.293 DBG [rtsp] new consumer stream=stream3

Notice that on the final stop at 10:07:46.961 the kill signal was sent, and the bash script exited as normal (I checked the process list with ps and double-checked for the pid in /proc, it was gone), but there was never a "[streams] stop producer" line because c.cmd.Wait() never returns. I know it is c.cmd.Wait() because I have a custom version that puts it in a context with a timeout and you can see that, on the times when it hangs, the context timeout kicks in and returns anyway and the problem never occurs, although I know this isn't the correct fix.

Overall, I'm confident the issue could be fixed with a modification to only use cmd.Wait() once, I just don't (yet) have the skills to code it correctly. As far as I can see, calling cmd.Wait() twice on the same process should return an error to the second caller anyway, but it's not because of a race condition I think because wait is being called the second time immediately after the kill signal is being sent. If the timing lucky it works, if it's unlucky, the second wait() never returns.

tsightler commented 2 days ago

Submitted PR https://github.com/AlexxIT/go2rtc/pull/1246 for this. The PR simply passes in the existing channel from the first wait to the closer so that it can wait on that vs calling wait a second time. For the pipe case, there's no initial wait() so the code behaves the same as the original code.

I'm still running a gambit of tests, but so far, after 100+ cycles, I've not been able to produce a hang in any case, while previously it would take on average on 10-12 cycles for the hang to happen and, with some test cases, it could happen in just 2-3 attempts.