TCP congestion control algorithm issue

egwakim commented 3 years ago

Hi, Since TRex uses 4.4BSD based TCP/IP stack, it's TCP throughput is too sensitive to delay and dup packets. Unlike modern Linux kernel's CUBIC algorithm, Reno algorithm halving window size if dup found and restart slow start. It means average throughput will be slowdown if any dup packets found. It's situation will be worse in high delayed remote network or mobile network.

Do you have any plan to upgrade 4.4BSD based TCP/IP stack to newer version of TCP/IP stack?

Best Regards Gwangmoon

hhaim commented 3 years ago

@egwakim it is not sensitive to delay (without drop/dup) just to dup and drop . The CC (congestion control) takes effect in those cases and usually when we test scale of routers/fw we stop the test before that (NDR), so it wasn't important for the main objective of TRex (scale). Another point is that for simulating such drop/dup cases you should add reasonable RTT and it will not that simple to queue so many packets in 100gbps. There are many types of CC now (that are not part of the new version of BSD), like goodle BBRv2 and Datacenter CC. If you have a need to verify the DUT under drop conditions, it worth starting by adding RTT simulation first

egwakim commented 3 years ago

In our test, TRex showed very low TCP throughput in high delay, out of order, dup and lossy traffic path. It was very sensitive, it drops window size to half and restart slow start from the beginning, increasing very slowy and it happens every dup found again and again. So, throughput graph showed sawtooth image, it's does go up any more. We tried many tunnable combinations, but, not possible to reach the same result with Linux case.

hhaim commented 3 years ago

@egwakim if you have only delayed the throughput could reach more than 10gbps for one flow. I would try to solve the DUP/DROP in your DUT. Why are you testing under drop conditions? Are you testing the TCP itself?

egwakim commented 3 years ago

Hi @hhaim , Most of mobile network environment have quite tough environment because it includes radio environment, we are trying to test capacity test on mobile network. In STL mode, we've got wanted maximum throughput for our STP. But, in ASTF mode with ftp test, it's throughput showed too low compared with STL mode. According to the packet analysis, with default TCP parameter, TCP ramp-up time to reach max throughput was too long, it was around 60 seconds. There were long RTT and re-transmission, duplicated packets were exist. But, throughput down quickly after packet loss or re-transmission and re-ramping up time was too slow.

I checked TCP/IP stack in TRex, It seems that reference TCP/IP stack updated quite a lot after BSD 4.4. For example, For handling RFC2581, current BSD4.4 have issue on congestion window handling.

https://tools.ietf.org/html/rfc2581 During congestion avoidance, cwnd is incremented by 1 full-sized segment per round-trip time (RTT). Congestion avoidance continues until congestion is detected. One formula commonly used to update cwnd during congestion avoidance is given in equation 2:

  cwnd += SMSS*SMSS/cwnd                     (2)

This adjustment is executed on every incoming non-duplicate ACK. Equation (2) provides an acceptable approximation to the underlying principle of increasing cwnd by 1 full-sized segment per RTT. (Note that for a connection in which the receiver acknowledges every data segment, (2) proves slightly more aggressive than 1 segment per RTT, and for a receiver acknowledging every-other packet, (2) is less aggressive.) Implementation Note: Since integer arithmetic is usually used in TCP implementations, the formula given in equation 2 can fail to increase cwnd when the congestion window is very large (larger than SMSS*SMSS). If the above formula yields 0, the result SHOULD be rounded up to 1 byte.

According to the RFC2581, the formula should not be zero. (If the above formula yields 0, the result SHOULD be rounded up to 1 byte.) But, it's fixed from BSD6.4. Current eTRex’s TCP/IP stack based on BSD4.4 https://github.com/freebsd/freebsd-src/blob/releng/4.4/sys/netinet/tcp_input.c#L1962 RFC2581 have been applied from BSD6.4 https://github.com/freebsd/freebsd-src/blob/releng/6.4/sys/netinet/tcp_input.c#L2139 Optional RFC3465 have been applied from BSD8.2 https://github.com/freebsd/freebsd-src/blob/releng/8.2/sys/netinet/tcp_input.c#L2332

Maybe following code need to be applied to meet RFC2581.

diff --git a/src/44bsd/tcp_input.cpp b/src/44bsd/tcp_input.cpp
index b5eb23dd..43b8a309 100644
--- a/src/44bsd/tcp_input.cpp
+++ b/src/44bsd/tcp_input.cpp
@@ -1255,7 +1255,7 @@ trimthenstep6:
         uint32_t incr = tp->t_maxseg;

         if (cw > tp->snd_ssthresh)
-            incr = incr * incr / cw;
+            incr = ( ((incr * incr / cw) > 1) ? (incr * incr / cw) : 1 ); /* RFC2581 */`
         tp->snd_cwnd = bsd_umin(cw + incr, TCP_MAXWIN<<tp->snd_scale);
         }
         if (acked > so->so_snd.sb_cc) {

egwakim commented 3 years ago

Hi @hhaim It seems that it;s caused from sensitivity on delay and packet loss from TCP congestion control algorithms. Current TCP Tahoe algorithm in BSD 4.4 have quite sensitive for packet loss and delay on long-distance network. It was found on the high delayed network. See https://en.wikipedia.org/wiki/TCP_congestion_control

Most of modern OSes(Linux, MacOS, Windows 10) using TCP Cubic algorithm(https://tools.ietf.org/html/rfc8312), https://en.wikipedia.org/wiki/CUBIC_TCP It's shows quite fair performance on most of environments. TCP Cubic and other modern algorithms(HTCP etc) were adopted from BSD 8.3.

I assume that the reasonable way to improve TCP throughput in long-distance network is upgrading TCP/IP stack in TRex. It's quite similar with other modern OSes.

hhaim commented 3 years ago

@egwakim it is possible to just add the CC instead of replacing all the TCP stack. The current code in bsd4.4 is new reno. I would add google BBR too (v1 and v2).

The main use-case is to test scenarios in the lab for high scale, the RTT and our point of stop is NDR so the CC is less of relevant. However I agree for long distance and with scenarios of drops there is a need for more up to date CC.

egwakim commented 3 years ago

BBR could be good alternative. But, majority of OSes using CUBIC now, I'm more interested with CUBIC in simulation of real traffic behavior point of view. Do you have plan to add BBR CC in TRex? Or do you have plan to change current TRex's TCP/IP stack to CC plug-able structure? Recent BSD version can plug-in various CC easily. See https://github.com/freebsd/freebsd-src/tree/main/sys/netinet/cc

hhaim commented 3 years ago

@egwakim I think the BSD plug-able struct is the way to go (with CUBIC) instead of replacing the all TCP/UDP code. We don't have a request yet for this, so it is not in our roadmap. Google BBR is still not part of the BSD

cisco-system-traffic-generator / trex-core

TCP congestion control algorithm issue #615