Stopping RTSP client before RTSP exec producer starts causes all future streams to fail

tsightler commented 5 months ago

Hi @AlexxIT, thanks for the fix for kill signal with exec, unfortunately, it seems this new way of killing processes have tickled a new bug. First a quick reminder of the process my code uses to start/stop RTSP producer streams to go2rtc. Streams are registered in go2rtc as exec that starts a bash script, so something like this (it's a little more than this, but this is the basic idea):

exec:start-stream.sh {camera_id} {output}

When a client connects to the RTSP URL for the stream, go2rtc executes that script, which communicates with a secondary process maintaining a constant connection to the cameras cloud API (without this, starting a stream takes too long). This process performs the steps needed to start stream from the specific camera, spawns an ffmpeg process that receives that stream from the cloud and forwards it as an RTSP producer to go2rtc.

When the last client disconnects go2rtc, using the new kill process in 1.9,4, kills the script via SIGTERM (the standard signal to kill a child process) and the bash script traps this, sends a signal to the upstream process to stop the stream, which kills the local ffmpeg process.

All of this works perfectly, however, when go2rtc executes start-stream.sh it takes, on average, about 2-3 seconds for all of the cloud setup to start and the stream to arrive, in some extreme cases it can take 5 seconds or more. If a client connects, and then disconnects before the stream actually starts, go2rtc will kill the start-stream.sh process as expected, and thus the stream never starts, but it's like go2rtc doesn't recognize this since it never had an RTSP producer. Future clients that connect only log a message that there is a new consumer, but go2rtc doesn't execute the command again because it feels like it's still expecting the producer from the previous run, even though it killed itself.

It's like a little bit of a race condition, I'd guess it would be very difficult to trigger if running ffmpeg directly, since the RTSP producer connection would happen almost immediately, but in my case, because there's several seconds of delay, it opens this window where a client can disconnect before the producer starts.

Interestingly, this did not happen with version <=1.9.2, but I think that's because the code didn't actually kill the exec process, so, in this case, based on what I can see, if the client disconnects, the ffmpeg ends up still starting the producer, even though there are no clients, and then go2rtc eventually disconnects the RTSP process which shuts everything down.

I could probably work around this, since I'm already trapping the TERM signal, by delaying the shutdown for a few seconds, but I still thought this was probably a bug worth fixing.

AlexxIT commented 5 months ago

Are you sure process actually stopped? Are you using killtimeout?

tsightler commented 5 months ago

Yes, process stops just fine, I get the log message from the script that it is being terminated (this will only happen if it gets proper TERM signal) and see the stop received by the secondary process, which is sent from the script prior to terminating, so that part is working great. Actually, that's why the ffmpeg process never starts, because the upstream process receives the stop command, unlike with previous versions.

TL;DR -- When using the exec source, if the client disconnects before the producer starts, the consumer isn't removed from the list. Because of this, go2rtc think the stream still has an existing consumer and future consumers do not trigger the script to execute again.

I believe you could reproduce this with a shell script that does nothing but sleep in a loop. I can try to do that if it makes it easier to understand. Everything else I wrote is just noise I guess.

tsightler commented 5 months ago

I will try to reproduce tomorrow with a simpler test case if needed, I'll admit it's quite late here and I've had a long day, so perhaps I'm missing something obvious. A quick look at the code didn't make it clear how this could be the case, perhaps I'm correct on the symptom, but incorrect on the root cause.

tsightler commented 4 months ago

I've been trying to track down this issue, but it feels more unpredictable than I at first thought, for example, the problem will occur even when the stream fully starts and I don't terminate it early, it's just less common to fail at that point. However, I have found a pattern.

In the config I set the kill signal to 15 (SIGTERM, the standard signal to kill a child process in Unix). I can tell my script receives this as it logs a message and sends a command to the secondary process prior to exiting, this always works and the script exits. However, I also set the killtimeout to 15 seconds, which, based on my understanding, should send SIGKILL to process if it hasn't exited in 15 seconds. What is strange is that I sometimes still see this in the logs, even though I can see that my process has exited. It's after this message that things don't work anymore.

So I decided to get rid of the killtimeout, and monitored the processes. When the client disconnects, go2rtc sends the kill signal and, based on monitoring the script process, it exits quickly, within 1 second, but it kind of feels like go2rtc doesn't always detect this. When it works I get logs like this:

10:38:37.717 DBG [rtsp] handle error="read tcp 172.30.154.239:8554->172.30.144.1:52517: read: connection reset by peer" +0ms
10:38:37.717 DBG [streams] stop producer url=exec:/home/tsightler/GitHub/ring-mqtt/scripts/start-stream.sh 3ce1a15e2437 live ring/a9ce28cb-fafb-421c-4508-29d54f48cbd3/camera/3ce1a15e2437 {output}#killsignal=15 +0ms
10:38:37.717 TRC [exec] kill with signal=15 +0ms
10:38:37.729 DBG [rtsp] disconnect stream=3ce1a15e2437_live +11ms
10:38:37.729 TRC [streams] stop reconnect url=exec:/home/tsightler/GitHub/ring-mqtt/scripts/start-stream.sh 3ce1a15e2437 live ring/a9ce28cb-fafb-421c-4508-29d54f48cbd3/camera/3ce1a15e2437 {output}#killsignal=15 +0ms

When it does not work the last two lines are missing:

10:35:56.166 DBG [rtsp] handle error="read tcp 172.30.154.239:8554->172.30.144.1:52357: read: connection reset by peer" +0ms
10:35:56.166 DBG [streams] stop producer url=exec:/home/tsightler/GitHub/ring-mqtt/scripts/start-stream.sh 3ce1a15e2437 live ring/a9ce28cb-fafb-421c-4508-29d54f48cbd3/camera/3ce1a15e2437 {output}#killsignal=15 +0ms
10:35:56.166 TRC [exec] kill with signal=15 +0ms

So go2rtc sent the kill signal, the script exited (based on watching the process) but there's never a disconnect line and thus go2rtc will never attempt to run the producer script again.

As it is intermittent, I'm a little at a loss as to what can be happening here, but I'll try to dig a little bit in the code and see what is triggering these next steps, but it is reproducible. If I stop and start the stream a few times, it will eventually end up in this state.

tsightler commented 4 months ago

@AlexxIT I'm starting to think this is some kind of go bug, or at least a behavior I don't understand. I added some debugging to the code and it appears, in seemingly random cases, that it hangs at the "wait" case c.cmd.Wait() from the return.

I put a debug line in the rtsp.go immediately prior and after the call to close. Most of the time I get both logs, but sometimes I only get the first one. When I added some debug to closer.go, it will log up to the last line prior to the return.

Looking at the docs, it seems like there are cases where Wait() will hang for stdin/out/err, so maybe something is happening there? I'm not sure honestly, grasping at straws really, but it feels like that is what is happening.

It's super confusing because it will sometimes fail the first time, sometimes work 10 times in a row prior to failing, very difficult to understand the behavior.

tsightler commented 4 months ago

While continuing to dig on this I decided to try the setup with mediamtx instead of go2rtc as it only required a very minor modification to the script and a different config file format and everything works perfectly there no matter how many times I start and stop the stream, the script always shuts down and child scripts are reaped nearly immediately in all cases. I'm not sure what that tells us as there's not a ton of difference in how mediamtx starts and stops exec processes vs go2rtc, but I'm trying to dig and understand what subtle differences there might be.

tsightler commented 4 months ago

I'm going to close this. I'm still not 100% sure what is going on, but I have started to understand a little more what is happening. One thing that mediamtx does differently is to setup the process group ID on the spawned process, and sends it's kill signal to the process group which means all spawned processes from the script, even backgrounded process, still get the signal. That's a big reason of why it behaves differently. I implemented the same logic in go2rtc and the behavior improved dramatically. I'm not saying that this is right or wrong thought, it's just different from how go2rtc works today.

However, in my case, I wouldn't expect that to be an issue because my script does trap all kill signals and it sends signals to the underlying processes so that they cleanup and exit. It almost seems like, every now and again, wait() gets hung because the process either a) already exited, or perhaps there something in stdout/stderr still waiting, I'm not sure. I don't see any processes or file descriptors open when this happens, but it's clear that wait() just ends up waiting forever for some reason. Maybe it's something to do with bash and how it handle file descriptors.

Regardless, with some minor refactoring it seems like I've managed to mostly address the behavior. I still see a case once in a while where the process appears to exit but go2rtc doesn't seem to recognize it, but it's become very difficult to reproduce so, until I have a better test case I'm just going to close this.

AlexxIT commented 4 months ago

I don't understand what you choose as fix. Modified go2rtc version or mediamtx.

tsightler commented 4 months ago

Sorry, to be more clear, I'm still using stock go2rtc exactly as it is delivered here, no modifications. I used mediamtx only to see if it behaved differently and, when it did, I modified a local go2rtc to use this same method and this also improved the behavior of go2rtc with my scripts. However, in the end I was able to use what I learned to modify my scripts and get the same behavior, even with stock go2rtc,. Basically I created a controller script that always sends signals to the child scripts, which is effectively what mediamtx does by default using the process group mechanism.

I still have a problem that exhibits sometimes, after 10 ore more successful start/stop cycles, go2rtc just seems to stop running the exec commands, even though I see no indication of any failure (the previous stop looks successful), so I need to dig into this, but I'll open a new issue if I can figure out a consistent way to reproduce.

AlexxIT / go2rtc

Stopping RTSP client before RTSP exec producer starts causes all future streams to fail #1204