Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.08k stars 846 forks source link

[BUG] 23X random transmit slowdown with packet drop setup #2313

Closed rwgitx closed 2 years ago

rwgitx commented 2 years ago

Describe the bug

SRT spends 23X more time to send 1000 messages within a 1% packet drop test environment. The slowness happens randomly.

To Reproduce

*) All tests are done in a single box with local network interface, 1% packet drop is added with netem

sudo modprobe sch_netem
sudo tc qdisc del dev lo root
sudo tc qdisc add dev lo root netem loss 1%

*) commits used 

srt version 8b32f3734ff6af7cc7b0fef272591cb80a2d1aae
xtransmit version 2499cfe3b05f919d446cfa7ef4a7dda45ada7a93

$ git submodule 
 5ce8958c7e3d2c871d1ba3180a4e4f1543eece4a submodule/CLI11 (v1.7.1-288-g5ce8958)
 3a0746bf5f601dfed05330aefcb6854354fce07d submodule/function2 (4.1.0)
 22a169bc319ac06948e7ee0be6b9b0ac81386604 submodule/spdlog (v1.2.1-1451-g22a169bc)
 8b32f3734ff6af7cc7b0fef272591cb80a2d1aae submodule/srt (v1.4.4)

*) Receiver
[srt-xtransmit]$ srt-xtransmit receive "srt://:4200?transtype=file&messageapi=1&payloadsize=1456&rcvbuf=1000000000&sndbuf=1000000000&fc=800000" --msgsize 1456  --statsfile stats-rcv.csv --statsfreq 1s --enable-metrics --metricsfile metrics-rx.csv --metricsfreq 1s

*) Sender

- Good case
[srt-xtransmit]$ time srt-xtransmit generate "srt://127.0.0.1:4200?transtype=file&messageapi=1&payloadsize=1456&rcvbuf=1000000000&sndbuf=1000000000&fc=800000" --msgsize 1456 --num 1000 --statsfile stats-snd.csv --statsfreq 1s --enable-metrics

real    0m1.053s
user    0m0.027s
sys     0m0.032s

- Bad case
[srt-xtransmit]$ time srt-xtransmit generate "srt://127.0.0.1:4200?transtype=file&messageapi=1&payloadsize=1456&rcvbuf=1000000000&sndbuf=1000000000&fc=800000" --msgsize 1456 --num 1000 --statsfile stats-snd.csv --statsfreq 1s --enable-metrics

real    0m23.110s
user    0m0.555s
sys     0m1.054s

Expected behavior

No massive 23X slowdown

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please provide the following information):

Additional context

Attached wireshark capture and srt-xtransmit stats.

wireshark.stats.zip

rwgitx commented 2 years ago

@maxsharabayko FYI, I did not open the bug in https://github.com/maxsharabayko/srt-xtransmit repo. I guess the culprit is probably within srt core not srt-xtransmit.

rwgitx commented 2 years ago

I managed to reproduce with srt-file-transmit too. It makes no sense to send a 10M file in 35 seconds, even with a 1% packet drop.

$ cat netem.sh
sudo modprobe sch_netem
sudo tc qdisc del dev lo root
sudo tc qdisc add dev lo root netem loss 1%

$ dd if=/dev/urandom of=test.data bs=10M count=1

Receiver:
$ ./srt-file-transmit srt://:5002 file:///tmp/test.data

Sender:
Bad case:
$ time ./srt-file-transmit file:///./test.data srt://127.0.0.1:5002
Target connected (caller)
File sent
Buffers flushed

real    0m35.301s
user    0m0.474s
sys     0m2.189s

Good case:
$ time ./srt-file-transmit file:///./test.data srt://127.0.0.1:5002
Target connected (caller)
File sent
Buffers flushed

real    0m0.795s
user    0m0.057s
sys     0m0.138s
maxsharabayko commented 2 years ago

The current file congestion control tries to avoid congestion. As you always apply a fixed packet drop percentage, the module can't stop reducing the sending rate. Although at 1% it should at least stay at the same rate.

File CC description: [link]. File CC source code: [link].

rwgitx commented 2 years ago

The current file congestion control tries to avoid congestion. As you always apply a fixed packet drop percentage, the module can't stop reducing the sending rate. Although at 1% it should at least stay at the same rate.

File CC description: [link]. File CC source code: [link].

Thanks for the swift response and the pointer, @maxsharabayko ! I will take a look.

ethouris commented 2 years ago

Netem isn't exactly a good representative of what happens in the real network. In the real network the percentage of dropped packets is proportional to how many of them exceed the invisible speed limit. The speed limit itself may be set to varry in time, but the drops should only happen when this speed is exceeded, not "always" as netem does. OTOH I personally don't know any network testing program that can do something like this.

The general idea behind the File CC is that it should slow down when seeing a packet drop and carefully speed up if no drops have occurred. If you have packets dropped all the time, even if you sent as slow as 100 packets in one minute, it will be slowing down all the time. In a real-world network, when you slow down beyond 3/4 of the average speed limit, you should have no drops at all (unless some router malfunction on the way, but that's rare).

rwgitx commented 2 years ago

I added some burst of loss:

sudo tc qdisc add dev lo root netem loss 1% 25%

It seems I could not reproduce the slowness with the drop setup.

rwgitx commented 2 years ago

Netem isn't exactly a good representative of what happens in the real network. In the real network the percentage of dropped packets is proportional to how many of them exceed the invisible speed limit. The speed limit itself may be set to varry in time, but the drops should only happen when this speed is exceeded, not "always" as netem does. OTOH I personally don't know any network testing program that can do something like this.

Sure. I totally agree netem here is not realistic. Perhaps we can add a script to emulate the drop when srt exceeds the limit and turn the drop off or reduce the drop percent on the fly.

The general idea behind the File CC is that it should slow down when seeing a packet drop and carefully speed up if no drops have occurred. If you have packets dropped all the time, even if you sent as slow as 100 packets in one minute, it will be slowing down all the time. In a real-world network, when you slow down beyond 3/4 of the average speed limit, you should have no drops at all (unless some router malfunction on the way, but that's rare).

I am not sure if this could happen in real life, for example, the router is exhausted during the hours when people watch video a lot, there are "streams" from multiple users sharing the same router. Even if one slows down, the router is still exhausted and decides to drop packet from all the users.

@ethouris Thanks for your insight!

ethouris commented 2 years ago

Yeah, but what you are talking about is perfectly within the frames what I described - this "invisible speed limit" may sometimes be lower, sometimes higher, sometimes in some extreme cases it can drop to 1/100 of the initial value, then get back to full speed etc.

And in case of having a "real world" network emulator program, this can be also programmed to behave somehow this way, maybe automatically, maybe on demand and manual control. What is important is that what is being manipulated here with is this invisible speed control. Might be also that there could be some very small throughput possible that can cause completely random drops even if you send below the speed limit, but normally even if this speed varries and gets low at the moment, sending with the speed below this limit should cause no drops.

rwgitx commented 2 years ago

As a comparison, I tried tcp with the same 1% drop. The result is pretty stable.

$ cat netem.sh

sudo modprobe sch_netem
sudo tc qdisc del dev lo root
sudo tc qdisc add dev lo root netem loss 1%

$ dd if=/dev/urandom of=test.data bs=1G count=1

$ time scp test.data 127.0.0.1:/tmp
test.data                                                                                                                                  100%   32MB 483.4MB/s   00:00    

real    0m0.310s
user    0m0.070s
sys     0m0.056s

$ time scp test.data 127.0.0.1:/tmp
test.data                                                                                                                                  100%   32MB 476.4MB/s   00:00    

real    0m0.286s
user    0m0.073s
sys     0m0.061s
rwgitx commented 2 years ago

Btw, what will happen with live mode using LiveCC if we have a constant ~1% drop. Will SRT reduce the sending rate all the time as well?

maxsharabayko commented 2 years ago

Btw, what will happen with live mode using LiveCC if we have a constant ~1% drop. Will SRT reduce the sending rate all the time as well?

No, it can't block the source stream. See SRT Packet Pacing and Live Congestion Control (LiveCC). But based on SRT stats encoder can adjust the bitrate. For example, see the Adaptive Rate Control for Live streaming using SRT protocol.

maxsharabayko commented 2 years ago

@rwgitx Normally you would want to limit the throughput. netem can do the "rate limit", but one way only (if I remember correctly, for incoming packets). LanForge offers nice and easy rate limit functionality.

Might be useful for you: Netem wrapper.

rwgitx commented 2 years ago

@rwgitx Normally you would want to limit the throughput. netem can do the "rate limit", but one way only (if I remember correctly, for incoming packets). LanForge offers nice and easy rate limit functionality.

Might be useful for you: Netem wrapper.

Nice. Thank you!

ethouris commented 2 years ago

In short, in Live CC there's no speed measurement done, as in File CC. In Live CC it's only possible to control the sending speed to fit below a user-defined speed limit from upside, that's all. As a paradox, netem is almost very good at testing the live mode, except when we are trying to test how it works when we are riding at the edge of speed limit in particular network - and the case when one loss has to be compensated by retransmissions, which causes overhead.

maxsharabayko commented 2 years ago

Closing as won't fix: artificial persistent 1% packet loss is not an expected test environment for FileCC. Consider using rate limit instead.