Haivision / srt

Secure, Reliable, Transport
https://www.srtalliance.org
Mozilla Public License 2.0
3.11k stars 851 forks source link

[BUG] Running srt-live-transmit behind NGINX proxy leads to high retransmit & loss #2805

Closed koenkarsten closed 1 year ago

koenkarsten commented 1 year ago

Describe the bug When running srt-live-transmit behind a NGINX proxy a high number of retransmits occur, which ultimatly leads to belated dropped packets & distortion in the payload.

To Reproduce Please find the Dockerfile.txt to reproduce attached (remove .txt extension):

  1. docker build -t srt-nginx .
  2. docker run -d -p 8000:8000/udp -p 19000:19000/tcp -p 9000:9000/udp -p 10000:10000/tcp srt-nginx
  3. Streaming to NGINX srt://localhost:8000?passphrase=supersecretpassphrase leads to very high pktRcvDrop/pktRcvRetrans/pktRcvBelated (~30% retrans).
  4. Streaming to SRT directly srt://localhost:9000?passphrase=supersecretpassphrase leads to very low pktRcvDrop/pktRcvRetrans/pktRcvBelated (0% retrans).

Running this Dockerfile gives the following running locally:

  1. NGINX listening for UDP input on port 8000, which will forward everything to 9000.
  2. Srt-live-transmit listening for UDP input on port 9000, which will forward to 224.0.0.0:29999.
  3. NGINX listening for HTTP on port 10000, where srt-live-transmit log & csv files can be accessed. 3a. http://localhost:10000/streamX.csv 3b. http://localhost:10000/streamX.log
  4. UDPXY listening for HTTP on port 19000, which can be used for playback of the uploaded video 4a. http://localhost:19000/udp/224.0.0.0:29999

Expected behavior When uploading SRT for a while the statistics can be accessed to show the issue:

curl http://localhost:10000/streamX.csv | awk -F "\"*,\"*" '{print $20,$21,$22}'
pktRcvDrop pktRcvRetrans pktRcvBelated
166 2584 29
76 2178 13
95 2462 32
90 2559 13
101 2492 13
85 2530 20
33 2354 4

For a healthy stream, these numbers should be near zero.

Screenshots image

Desktop (please provide the following information):

Additional context I've tried many different NGINX setups, all failing to fix the issue. Would love to learn more where this comes from as it seems a conceptual / config issue as this already occurs with streams as low as 700kbit/s all the way through 7mbit/s. For this entire range the retrans rate is always around 30%, even when uploading over the internet to a VM/EC2 in AWS or testing locally on laptop with Docker.

Dockerfile.txt (remove .txt extension)

koenkarsten commented 1 year ago

NGINX config used in Dockerfile:

load_module '/usr/lib/nginx/modules/ngx_stream_module.so';
worker_processes auto;
events {
}
http {
    server {
        listen 10000;
        root /var/log/srt;
        location / {
        }
    }
}
stream {
    server {
        listen 8000 udp;
        proxy_pass localhost:9000;
    }
}

Besides this tried many proxy buffer / rate / keepalive settings etc, none of which help. Also tuned workers & connections, same results, whether running on big VM or local.

ethouris commented 1 year ago

All I can see is that you experience 50% loss on the UDP link, or possibly high reordering. Might be helpful to understand what is happening if you record a pcap file on the receiver side. I actually doubt that it is reordering because if this was the case, the number of belated packets would be much higher, close to the values of half of retransmission.

Note one thing: it is not SRT responsible that these numbers are so high, but your UDP link. Maybe SRT can behave better when the parameters are slightly better adjusted, but I can't see any unexpected behavior here. I'm even surprised that the number of dropped packets is so low.

koenkarsten commented 1 year ago

Yes the UDP link (NGINX) certainly is the issue here, so this inquiry is more a call for information from anyone who setup source -> NGINX -> srt-live-transmit before. In a way it shows the great performance of SRT by the relatively low number of dropped packets indeed 👍 . So if anybody can supply information on this topic, or provide alternative UDP proxy setups I'm all ears!

ethouris commented 1 year ago

Well, you misled us a bit by reporting a "BUG". Things would go different, if you chose a "QUESTION".

For starters, I can advise you setting a higher latency. Very low latency values are good for only a bit lossy links. With high losses it is usually unlikely to have a good recovery rate, but here SRT behaves surprisingly well, so might be that the link has high bandwidth capacity fluctuations (that is, often the capacity is above the bitrate, otherwise you wouldn't see so many retransmissions). But to recover packets, and especially if it would need secod or third time retransmissions, you need to give SRT more time to deliver the packets. Note also that with extremely high latencies you might need to set a higher value of the receiver buffer size.

koenkarsten commented 1 year ago

Thanks, the mentioned tweaks have been performed and helped a bit, but not the full way. After a lot of debugging we pin pointed the issue, which lays on the NGINX side as expected: Setting worker_processes: auto ensures 1 worker process per CPU, but having multiple causes a lot of out these of order messages for SRT to process. Setting it to worker_processes: 1 "fixes" the issue, by enforcing all messages to be processed by the same worker, of course not utilising the full capacity of the server as a tradeoff. So at least I know the root cause now and can see how to fix this properly from here. Thanks for the suggestions!

mentalrey commented 3 months ago

Thanks for the workaround: @koenkarsten (I wish the solution was another, in case the Nginx team works on it)

We also found a big packet loss problem when sending SRT video streams over UDP protocol. The problem lies in the UDP protocol itself, which is able to recognize retransmitted packets only when they arrive from the same source port that was already sending the previous one.

Passing through Nginx, which by default splits the signal across multiple workers, a large number of these packets arrive at their destination with a different source port and the video app treats them as independent flows, effectively losing the packet and creating video glitches. When worker 1 is set, transmission occurs without changing the source port and packets arrive with almost 0% loss