FLV context deadline exceeded

dagleaves commented 5 months ago

I come bringing an interesting issue. I am seeing issues similar to #215 in case anything has changed since then. Looking back at the logs, for the location that uses Reolink cameras via HTTP FLV, I see an interesting pattern that resulted in all cameras going down at the same time.

This was experienced with Reolink cameras, go2rtc v1.9.3, and using HTTP FLV as the connection protocol.

undefined error=context deadline exceeded (Client.Timeout or context cancellation while reading body) caller=github.com/AlexxIT/go2rtc/internal/streams/producer.go:171

Followed by, "write tcp 127.0.0.1:8554->127.0.0.1:44922: write: broken pipe" The broken pipe seems to be because the stream dropped, the audio ffmpeg producer gets an invalid data received error EOF, exec/rtsp\n[in#0 @ 0x5d10eea068c0] Error opening input: Invalid data found when processing input\nError opening input file. Error opening input files: Invalid data found when processing input

Looking back at the logs for all my setups, I see no issues with RTSP or ONVIF (via RTSP) with non-Reolink cameras. The HTTP FLV seems to have consistent errors of dropping connections, timeouts, and invalid data. It also doesn't just happen with one NVR, when it does happen, it drops connection with almost every NVR.

I also notice that I was experiencing #767 and didn't notice (again only with the Reolink HTTP FLV, my other locations seem fine). After switching to v1.9.3, I still experienced the context timeout issue. Only now, go2rtc doesn't crash because #767 is fixed. That seems good, but from what I can tell, by go2rtc crashing, it was allowing it to reconnect faster to those FLV streams that had timed out. Now, it does not reconnect because go2rtc didn't crash. (Edit: see below, I don't think this is true, I think it's just more noticeable and might take longer to reconnect)

This is unfortunate because of the massive delay and stability issues RTSP was having with the Reolink cameras. Any suggestions regarding the FLV timeout issue? Anything change since then? I've switched back to the previous version in the meantime. I am wondering if this location has a network issue causing connection instability every so often, which is dropping the connections to the NVRs by timeout, unable to reconnect, crashing (and with 1.9.3 not crashing anymore).

dagleaves commented 5 months ago

One location seems fine even after the timeout issue on 1.9.3, so I'm guessing it's that because all of their NVRs lost connection at the same time, they noticed the streams drop. The connection that is fine only has 1 Reolink NVR, so they probably don't notice it dropping and reconnecting as severely.

I wonder if the Reolink NVRs disconnect FLV clients after 60 minutes of connection. Seems like that's the pattern I am seeing, about an hour of being connected and they get a timeout error

dagleaves commented 5 months ago

Some added context

I am using an audio stream with the Reolink cameras in order to transcode it to OPUS to use with WebRTC

- ffmpeg:CAMERA#audio=opus

In 1.9.2, there is an issue that would cause the passive consumers for the audio channel to never leave and continue piling up until the server crashed. I see upwards of 40 of these guys on each stream,

{
"type": "RTSP passive consumer",
"url": "rtsp://127.0.0.1:8554/CAMERA?audio",
"remote_addr": "127.0.0.1:33194",
"user_agent": "go2rtc/ffmpeg",
"sdp": "v=0\r\no=- 1 1 IN IP4 0.0.0.0\r\ns=go2rtc/1.9.1\r\nc=IN IP4 0.0.0.0\r\nt=0 0\r\nm=audio 0 RTP/AVP 96\r\na=rtpmap:96 MPEG4-GENERIC/16000/1\r\na=fmtp:96 streamtype=5;profile-level-id=1;mode=AAC-hbr;sizelength=13;indexlength=3;indexdeltalength=3;config=1408\r\na=control:trackID=0\r\n",
"medias": [
"audio, sendonly, ANY"
],
"senders": [
"96 MPEG4-GENERIC/16000/1, bytes=1496576, receivers=1"
],
"send": 1555036
}

Now, in 1.9.4, this does not seem to be an issue. However, the stream will disconnect and never reconnect once it times out.

dagleaves commented 5 months ago

Documenting for myself in case I figure it out

The context issue is fine. The video seems to reconnect itself. If I do not use an ffmpeg audio stream to convert to opus, everything is fine. If I never leave the stream, it will continue playing video. If I try to join after this happens, it will not display anything. It will not show the new client in the consumer list.

The issue is that when the context error happens, ffmpeg tries to restart the audio stream when it reaches EOF. For whatever reason, the producer to OPUS is fine but the consumer from aac does not restart.

Network vis shows well:

Broken:

Normal:

dagleaves commented 5 months ago

The audio will have a normal EOF error and reconnect. But when it dies, it is because of a timeout error. Adding some debug statements to see if I can pin down why it's not reconnecting

dagleaves commented 5 months ago

Once an EOF happens, the server will get a few OPTIONS requests, then the opus producer will timeout

warn | undefined error=read tcp 127.0.0.1:8554->127.0.0.1:37520: i/o timeout

After the producer timeout, it will get another OPTIONS and only then will it get a DESCRIBE and reconnect the consumer.

After the audio i/o timeout stream error (not the ffmpeg producer timeout), there are no more OPTIONS requests logged by the server. I assume the tcpHandler server is still up, but whoever is sending the OPTIONS is dead? Not sure yet.

dagleaves commented 5 months ago

Don't have much experience with golang, so tracing things is tough. Hard to tell which handler the error is happening on. I'm assuming it's the ffmpeg consumer for the from aac, but I can't tell where that gets initialized or why it would not reconnect itself like when it happens with an EOF.

dagleaves commented 5 months ago

EOF is sometimes caused by the following, but not always :shrug: Still no lead on where the i/o timeout error is occurring. Not in conn.ReadResponse anyways. Has to be in the Handle function somewhere

2:01:36.856 AM | debug | [exec] [in#0/rtsp @ 0x57440c8b78c0] Error during demuxing: Connection timed out
2:01:36.856 AM | debug | [exec] [in#0/rtsp @ 0x57440c8b78c0] Error retrieving a packet from demuxer: Connection timed out
2:01:36.858 AM | trace | [rtsp] server request:TEARDOWN rtsp://127.0.0.1:8554/7DS2-5?audio RTSP/1.0Cseq: 9User-Agent: go2rtc/ffmpegSession: 49924033
2:01:36.858 AM | debug | [rtsp] handle error=EOF
2:01:36.858 AM | debug | [rtsp] disconnect stream=7DS2-5

dagleaves commented 5 months ago

If I replace the audio source with the camera directly instead of using the go2rtc stream, it works fine it's just slower to connect. Need to figure out why the producer times out when the direct stream doesn't.

AlexxIT / go2rtc

FLV context deadline exceeded #1197