Closed clime closed 1 month ago
@clime's comment:
With srt-xtransmit-B0-physical-disconnect-loss it seems that disconnecting B0 lead to overall connection loss even though A1, B1 were still physically connected.
PCAP for B0: no response from A0 starting at 12.440011942 (07:18:26.058145258)
PCAP for B1: no data packets are sent from 12.1265 to 12.7568 (for ~600 ms) Wallclock 07:18:28.669548076. UMSG_ACK(0) is delayed, although it is expected to be sent immediately after a data packet is received. Network hardware issue?
I don't see any connection loss on B1. DATA packets keep coming. But there is a big gap of more than 1 s between consecutive full ACKs on B1 around the time the last ACK was sent on B0 and the member link was likely broken by idle timeout.
11.381865522 Full ACK 2006
12.756493741 Lite ACK
12.756616341 Full ACK 2007
Full ACK comes with the seqno of almost the previously received DATA packet. Except for ACK 2007, which acknowledges data packet received at 11.388787. At that's around 1s after link A0--B0 was broken.
ACK 2006 timestamp 20249979
ACK 2007 timestamp 21624728
So ACK 2007 is sent by SRT 1.4 seconds later than it should be.
@clime I still think we do need pcaps from both sides. Meaning, for example:
You have machines A and B with links 0 and 1. You make the broadcast group with A0->B0 and A1->B1. In the test you physically break the A0->B0 link. In this case we need pcaps recorded from the devices A1 and B1. The other two could be useful, but not that important (we expect the link to break there), like for example to determine a coincidence with the link break. This pcap set (A1 and B1) is essential to see which packets have departed from a given device, but didn't arrive at the destination, in case when no distortion on this link was expected.
Here are 2 pcap where we can see the connection lost : pcap_connection_lost.zip
listener_2828.zip caller_2828.zip Here are 2 more capture and their associated logs. In those captures, you can ses the connection breaking and reconnecting
pcaps_logs_2828_3.zip Some more logs and pcaps of the same scenario :
pcaps only show srt packets but I still have the full captures if needed.
This issue can be reproduced pretty fast by specifying 2 IP to the caller, with one of them unreachable :
./srt-xtransmit generate srt://actual_ip:5999 srt://fake_ip:5999 --sendrate 100Mbps --enable-metrics
srt will connect with actual_ip. Connection through fake_ip will obviously fail. After a while, the connection through actual_ip will broke.
This seems to only work over ethernet
@ethouris I think yonmes has done the job already (thanks! : )) but I've uploaded some additional captures (from all the interfaces) to the slack thread. I have also tested 50Mbs transfer where the issue didn't seem to be reproducible.
Ok, in the logs there's nothing - it looks like the code wasn't properly configured for logs.
In the pcaps I can see one connection break with stopped transmission - but the reason that the transmission was stopped was because the ACK packet has reported no space left in the receiver buffer. That explains the stopped transmission. The reason for a broken connection and restoring attempt is unclear, but could be as well due to having been closed by the application.
@ethouris You are talking about https://srtalliance.slack.com/files/U01C757PSG7/F06BVTHEQ6A/srt-xtransmit-redundancy-all-pcaps-lost-connection-b0-physical-disconnect.tar, right? So I guess the problem is in the size of receive buffers? Is this configurable somewhere?
Btw. if somebody could check: https://srtalliance.slack.com/files/U01C757PSG7/F06C2H82DS6/trc-srt-redundancy-all-pcaps-a1-physical-disconnect.tar https://srtalliance.slack.com/files/U01C757PSG7/F06C2DY37LK/trc-srt-redundancy-all-pcaps-b0-physical-disconnect.tar which are pcaps captured when we ran our application (with SRT redundancy implemented and used in the experiment) - we have got some corrupted/missing packets when I disconnected one of the network paths physically. It might be related to this issue with broken connection or it might be something entirely different (whole thread: https://srtalliance.slack.com/archives/C79B8M2SZ/p1699464592483509)
I'm talking about the results that Yannick provided.
Note that there are two distinct possible behaviors of the reader - one is when the receiver completely stopped reading, in which case there's nothing that the sender can do but to break, and reading the packets too slow, or having some temporary spike in the data reading, which could probably be mitigated by increasing the receiver buffer size.
I'll check what is in these pcaps you provided.
@yomnes0, @clime Does the issue persists at 100 Mbps with a larger receiver buffer?
P.S. Please also note the Configuration Guidelines.
PR #2870 fixed a performance issue in reading from the broadcast group. @clime @yomnes0 Please retest this issue.
Closing due to inactivity (possibly fixed).
Setup: In my testing setup, I have two machines (A, B) with 2 NICs each - so let's call these NICs: A0, A1 (machine A) and B0, B1 (machine B). A0, B0 NICs are connected directly by an ethernet cable and A1, B1 are connected through a switch.
Action: Disconnected B0 NIC.
Listener's log on machine B:
Caller's log on machine A:
NOTES:
Discussed at https://srtalliance.slack.com/archives/C79B8M2SZ/p1699464592483509