[BUG] Connection lost with broadcast mode when one of the links is disconnected

clime commented 9 months ago

Setup: In my testing setup, I have two machines (A, B) with 2 NICs each - so let's call these NICs: A0, A1 (machine A) and B0, B1 (machine B). A0, B0 NICs are connected directly by an ethernet cable and A1, B1 are connected through a switch.

Action: Disconnected B0 NIC.

Listener's log on machine B:

$ ./srt-xtransmit receive 'srt://0.0.0.0:5999?groupconnect=1&mode=listener' --enable-metrics --metricsfreq 1s --reconnect
22:18:05.223707 [I] SOCKET::SRT srt://0.0.0.0:5999: bound to '0.0.0.0:5999'.
22:18:07.675878 [I] SOCKET::SRT 0xC648E62 (srt://0.0.0.0:5999) Accepted connection 0x4C648E60. TSBPD Latency RCV 120ms, peer 120ms. KM state UNSECURED (RCV UNSECURED, SND UNSECURED). PB key length: 0. Cryptomode . Stream ID: not set.
22:18:07.676494 [I] RECEIVE Latency, us: avg n/a, min n/a, max n/a. Jitter: 0us. Delay Factor: 1us. Pkts: rcvd 0, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:08.676584 [I] RECEIVE Latency, us: avg 118263, min 118208, max 128193. Jitter: 4us. Delay Factor: 9983us. Pkts: rcvd 8339, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:09.676555 [I] RECEIVE Latency, us: avg 118265, min 118211, max 127795. Jitter: 4us. Delay Factor: 9585us. Pkts: rcvd 17815, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:10.676569 [I] RECEIVE Latency, us: avg 118264, min 118212, max 126578. Jitter: 4us. Delay Factor: 8366us. Pkts: rcvd 27290, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:11.676567 [I] RECEIVE Latency, us: avg 118264, min 118212, max 126630. Jitter: 3us. Delay Factor: 8418us. Pkts: rcvd 36766, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:12.676572 [I] RECEIVE Latency, us: avg 118264, min 118211, max 128278. Jitter: 3us. Delay Factor: 10067us. Pkts: rcvd 46242, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:13.676614 [I] RECEIVE Latency, us: avg 118272, min 118210, max 127009. Jitter: 6us. Delay Factor: 8798us. Pkts: rcvd 55720, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:14.676616 [I] RECEIVE Latency, us: avg 118308, min 118211, max 127246. Jitter: 22us. Delay Factor: 9035us. Pkts: rcvd 65198, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:15.676612 [I] RECEIVE Latency, us: avg 118258, min 118211, max 126605. Jitter: 8us. Delay Factor: 8394us. Pkts: rcvd 74675, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:16.676545 [I] RECEIVE Latency, us: avg 118268, min 118209, max 127705. Jitter: 3us. Delay Factor: 9496us. Pkts: rcvd 84153, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:17.676553 [I] RECEIVE Latency, us: avg 118260, min 118212, max 127890. Jitter: 13us. Delay Factor: 9678us. Pkts: rcvd 93633, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:18.676613 [I] RECEIVE Latency, us: avg 118182, min 118135, max 126591. Jitter: 3us. Delay Factor: 8455us. Pkts: rcvd 103114, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:19.676572 [I] RECEIVE Latency, us: avg 118186, min 118133, max 128174. Jitter: 4us. Delay Factor: 10040us. Pkts: rcvd 112594, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:20.676554 [I] RECEIVE Latency, us: avg 118210, min 118132, max 127903. Jitter: 27us. Delay Factor: 9771us. Pkts: rcvd 122071, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:21.676631 [I] RECEIVE Latency, us: avg 118182, min 118131, max 125375. Jitter: 12us. Delay Factor: 7245us. Pkts: rcvd 131550, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:22.676613 [I] RECEIVE Latency, us: avg 118183, min 118130, max 128126. Jitter: 5us. Delay Factor: 9996us. Pkts: rcvd 141028, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:23.676612 [I] RECEIVE Latency, us: avg 118341, min 118133, max 128225. Jitter: 32us. Delay Factor: 10093us. Pkts: rcvd 150503, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:24.676617 [I] RECEIVE Latency, us: avg 118185, min 118132, max 127350. Jitter: 7us. Delay Factor: 9219us. Pkts: rcvd 159984, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:25.676614 [I] RECEIVE Latency, us: avg 118182, min 118132, max 127761. Jitter: 7us. Delay Factor: 9630us. Pkts: rcvd 169466, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:26.676617 [I] RECEIVE Latency, us: avg 118183, min 118132, max 128291. Jitter: 10us. Delay Factor: 10158us. Pkts: rcvd 178947, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:27.676618 [I] RECEIVE Latency, us: avg 118181, min 118132, max 128297. Jitter: 8us. Delay Factor: 10165us. Pkts: rcvd 188386, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:28.676653 [I] RECEIVE Latency, us: avg 118181, min 118131, max 128224. Jitter: 10us. Delay Factor: 9991us. Pkts: rcvd 191974, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:29.676618 [I] RECEIVE Latency, us: avg 118180, min 118120, max 1366253. Jitter: 7us. Delay Factor: 1248047us. Pkts: rcvd 207367, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:30.676629 [I] RECEIVE Latency, us: avg 118288, min 118128, max 128243. Jitter: 6us. Delay Factor: 10115us. Pkts: rcvd 216836, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:31.676616 [I] RECEIVE Latency, us: avg 118288, min 118239, max 118376. Jitter: 4us. Delay Factor: 128us. Pkts: rcvd 221244, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:32.676565 [I] RECEIVE Latency, us: avg 661609, min 660111, max 1347881. Jitter: 100us. Delay Factor: 687700us. Pkts: rcvd 228234, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:33.226459/srt-xtransmit*E:SRT.gr: grp/recv: $1281658464: ABANDONING: opened=false connected=false
22:18:33.226503 [W] RECEIVE read::recv: Connection does not exist
22:18:33.676887 [I] SOCKET::SRT 0xC648E62 (srt://0.0.0.0:5999) Accepted connection 0x4C648E5D. TSBPD Latency RCV 120ms, peer 120ms. KM state UNSECURED (RCV UNSECURED, SND UNSECURED). PB key length: 0. Cryptomode . Stream ID: not set.
22:18:33.677029 [I] RECEIVE Latency, us: avg n/a, min n/a, max n/a. Jitter: 0us. Delay Factor: 1us. Pkts: rcvd 0, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.
22:18:34.677147 [I] RECEIVE Latency, us: avg n/a, min n/a, max n/a. Jitter: 0us. Delay Factor: 1us. Pkts: rcvd 0, reordered 0 (dist 0), lost 0, MD5 err 0, bad len 0.

Caller's log on machine A:

$ ./srt-xtransmit generate srt://172.22.10.236:5999 srt://192.168.10.236:5999 --sendrate 100Mbps --enable-metrics --reconnect
07:18:07.678689 [I] PACER sendrate 100000000 bps (inter send interval 105 us)
07:18:08.678778 [I] GENERATE Sending at 99805 kbps
07:18:09.678862 [I] GENERATE Sending at 99773 kbps
07:18:10.678926 [I] GENERATE Sending at 99763 kbps
07:18:11.679004 [I] GENERATE Sending at 99763 kbps
07:18:12.679077 [I] GENERATE Sending at 99784 kbps
07:18:13.679139 [I] GENERATE Sending at 99784 kbps
07:18:14.679245 [I] GENERATE Sending at 99784 kbps
07:18:15.679298 [I] GENERATE Sending at 99784 kbps
07:18:16.679334 [I] GENERATE Sending at 99794 kbps
07:18:17.679381 [I] GENERATE Sending at 99805 kbps
07:18:18.679451 [I] GENERATE Sending at 99815 kbps
07:18:19.679464 [I] GENERATE Sending at 99805 kbps
07:18:20.679528 [I] GENERATE Sending at 99784 kbps
07:18:21.679540 [I] GENERATE Sending at 99784 kbps
07:18:22.679642 [I] GENERATE Sending at 99794 kbps
07:18:23.679690 [I] GENERATE Sending at 99763 kbps
07:18:24.679775 [I] GENERATE Sending at 99826 kbps
07:18:25.679782 [I] GENERATE Sending at 99826 kbps
07:18:26.679867 [I] GENERATE Sending at 99826 kbps
07:18:27.679924 [I] GENERATE Sending at 99742 kbps
07:18:28.679981 [I] GENERATE Sending at 99731 kbps
07:18:29.680003 [I] GENERATE Sending at 99731 kbps
07:18:30.680086 [I] GENERATE Sending at 99710 kbps
07:18:31.680174 [I] GENERATE Sending at 99710 kbps
07:18:32.369776 [W] SOCKET::SRT_GROUP Member socket 0x33E6A129 (token 2) connection error: (2001) Connection was broken.
07:18:32.369915 [W] GENERATE socket::write::send: Connection was broken
07:18:32.406299 [I] PACER sendrate 100000000 bps (inter send interval 105 us)
07:18:33.406319 [I] GENERATE Sending at 99731 kbps
07:18:35.372760 [W] SOCKET::SRT_GROUP Member socket 0x33E6A126 (token 4) connection error: (1001) Connection setup failure: connection timed out.
07:18:35.381302 [W] GENERATE socket::write::send: Connection was broken
07:18:38.383786 [W] SOCKET::SRT_GROUP Member socket 0x33E6A123 (token 6) connection error: (1001) Connection setup failure: connection timed out.
07:18:38.383787 [W] SOCKET::SRT_GROUP Member socket 0x33E6A124 (token 5) connection error: (1001) Connection setup failure: connection timed out.
07:18:38.388270 [W] CONN srt_group::connect: Connection was broken
07:18:41.389755 [W] SOCKET::SRT_GROUP Member socket 0x33E6A122 (token 5) connection error: (1001) Connection setup failure: connection timed out.
07:18:41.390745 [W] SOCKET::SRT_GROUP Member socket 0x33E6A121 (token 6) connection error: (1001) Connection setup failure: connection timed out.
07:18:41.394396 [W] CONN srt_group::connect: Connection was broken
07:18:41.394857 [I] PACER sendrate 100000000 bps (inter send interval 105 us)
07:18:41.395105 [W] GENERATE socket::write::send: Connection was broken

NOTES:

doesn't happen always. I've seen it happened twice from about 5 tries with the exactly same testing setup. In the other attempts, the connection stayed thanks to the remaining connected link.
Discussed at https://srtalliance.slack.com/archives/C79B8M2SZ/p1699464592483509
- OS: linux
- SRT Version / commit ID: tested with srt-xtransmit 27186a69. It links srt-1.5.3.rc0.

maxsharabayko commented 9 months ago

srt-xtransmit-B0-physical-disconnect-connection-loss

@clime's comment:

With srt-xtransmit-B0-physical-disconnect-loss it seems that disconnecting B0 lead to overall connection loss even though A1, B1 were still physically connected.

PCAP for B0: no response from A0 starting at 12.440011942 (07:18:26.058145258)

PCAP for B1: no data packets are sent from 12.1265 to 12.7568 (for ~600 ms) Wallclock 07:18:28.669548076. UMSG_ACK(0) is delayed, although it is expected to be sent immediately after a data packet is received. Network hardware issue?

I don't see any connection loss on B1. DATA packets keep coming. But there is a big gap of more than 1 s between consecutive full ACKs on B1 around the time the last ACK was sent on B0 and the member link was likely broken by idle timeout.

11.381865522 Full ACK 2006
12.756493741 Lite ACK
12.756616341 Full ACK 2007

Full ACK comes with the seqno of almost the previously received DATA packet. Except for ACK 2007, which acknowledges data packet received at 11.388787. At that's around 1s after link A0--B0 was broken.

ACK 2006 timestamp 20249979
ACK 2007 timestamp 21624728

So ACK 2007 is sent by SRT 1.4 seconds later than it should be.

ethouris commented 9 months ago

@clime I still think we do need pcaps from both sides. Meaning, for example:

You have machines A and B with links 0 and 1. You make the broadcast group with A0->B0 and A1->B1. In the test you physically break the A0->B0 link. In this case we need pcaps recorded from the devices A1 and B1. The other two could be useful, but not that important (we expect the link to break there), like for example to determine a coincidence with the link break. This pcap set (A1 and B1) is essential to see which packets have departed from a given device, but didn't arrive at the destination, in case when no distortion on this link was expected.

yomnes0 commented 9 months ago

Here are 2 pcap where we can see the connection lost : pcap_connection_lost.zip

yomnes0 commented 9 months ago

listener_2828.zip caller_2828.zip Here are 2 more capture and their associated logs. In those captures, you can ses the connection breaking and reconnecting

yomnes0 commented 9 months ago

pcaps_logs_2828_3.zip Some more logs and pcaps of the same scenario :

Both are again on the remaining interface
The issue is slightly different than the one on the previous pcaps. After removing one interface, data transmission stops on the other one and we get disconnected after a while. Connection attempt are made by the caller, but are met with a backlog full answer.
Don't mind the ICMP at the end, I killed the caller and listener before stopping the capture.

pcaps only show srt packets but I still have the full captures if needed.

yomnes0 commented 8 months ago

This issue can be reproduced pretty fast by specifying 2 IP to the caller, with one of them unreachable :

./srt-xtransmit generate srt://actual_ip:5999 srt://fake_ip:5999 --sendrate 100Mbps --enable-metrics

srt will connect with actual_ip. Connection through fake_ip will obviously fail. After a while, the connection through actual_ip will broke.

This seems to only work over ethernet

clime commented 8 months ago

@ethouris I think yonmes has done the job already (thanks! : )) but I've uploaded some additional captures (from all the interfaces) to the slack thread. I have also tested 50Mbs transfer where the issue didn't seem to be reproducible.

ethouris commented 8 months ago

Ok, in the logs there's nothing - it looks like the code wasn't properly configured for logs.

In the pcaps I can see one connection break with stopped transmission - but the reason that the transmission was stopped was because the ACK packet has reported no space left in the receiver buffer. That explains the stopped transmission. The reason for a broken connection and restoring attempt is unclear, but could be as well due to having been closed by the application.

clime commented 7 months ago

@ethouris You are talking about https://srtalliance.slack.com/files/U01C757PSG7/F06BVTHEQ6A/srt-xtransmit-redundancy-all-pcaps-lost-connection-b0-physical-disconnect.tar, right? So I guess the problem is in the size of receive buffers? Is this configurable somewhere?

Btw. if somebody could check: https://srtalliance.slack.com/files/U01C757PSG7/F06C2H82DS6/trc-srt-redundancy-all-pcaps-a1-physical-disconnect.tar https://srtalliance.slack.com/files/U01C757PSG7/F06C2DY37LK/trc-srt-redundancy-all-pcaps-b0-physical-disconnect.tar which are pcaps captured when we ran our application (with SRT redundancy implemented and used in the experiment) - we have got some corrupted/missing packets when I disconnected one of the network paths physically. It might be related to this issue with broken connection or it might be something entirely different (whole thread: https://srtalliance.slack.com/archives/C79B8M2SZ/p1699464592483509)

ethouris commented 7 months ago

I'm talking about the results that Yannick provided.

Note that there are two distinct possible behaviors of the reader - one is when the receiver completely stopped reading, in which case there's nothing that the sender can do but to break, and reading the packets too slow, or having some temporary spike in the data reading, which could probably be mitigated by increasing the receiver buffer size.

I'll check what is in these pcaps you provided.

maxsharabayko commented 7 months ago

@yomnes0, @clime Does the issue persists at 100 Mbps with a larger receiver buffer?

P.S. Please also note the Configuration Guidelines.

maxsharabayko commented 6 months ago

PR #2870 fixed a performance issue in reading from the broadcast group. @clime @yomnes0 Please retest this issue.

maxsharabayko commented 1 month ago

Closing due to inactivity (possibly fixed).

Haivision / srt

[BUG] Connection lost with broadcast mode when one of the links is disconnected #2828

srt-xtransmit-B0-physical-disconnect-connection-loss