Slow on localhost with default settings

dignifiedquire commented 1 year ago

I have been testing s2n-quic on localhost, and am seeing a speed of around 1.5 Gbit/s, running with iperf3 on the same machine with UDP, I am seeing more than 6Gbit/s. I was wondering if there is something I could adjust in the configuration to improve this, or if this is a known issue. Some additional info

code is based on the echo example
Seeing similar speeds on my mac (M1 Pro) and an arch linux machine (AMD processor)

camshaft commented 1 year ago

Hmm that seems quite low. You can try using our netbench tool. I'm able to achieve almost 8Gbit/s on my machine:

$ cd netbench
$ cargo build --release
$ ./target/release/netbench-scenarios --request_response.response_size=10GiB
$ PORT=3000 SCENARIO=./target/netbench/request_response.json ./target/release/netbench-driver-s2n-quic-server
$ SERVER_0=localhost:3000 SCENARIO=./target/netbench/request_response.json ./target/release/netbench-driver-s2n-quic-client
0:00:01.000382 throughput: rx=983.06MBps tx=999Bps
0:00:02.001212 throughput: rx=930.97MBps tx=0Bps
0:00:03.002278 throughput: rx=931.40MBps tx=0Bps
0:00:04.003062 throughput: rx=928.28MBps tx=0Bps
0:00:05.004160 throughput: rx=930.81MBps tx=0Bps
0:00:06.005241 throughput: rx=928.78MBps tx=0Bps
0:00:07.006382 throughput: rx=930.50MBps tx=0Bps
0:00:08.007209 throughput: rx=929.02MBps tx=0Bps
0:00:09.008089 throughput: rx=931.42MBps tx=0Bps
0:00:10.009223 throughput: rx=930.06MBps tx=0Bps
0:00:11.010127 throughput: rx=933.53MBps tx=0Bps

One thing to note is that macOS doesn't support the same optimized UDP APIs as Linux (namely sendmmsg and GSO) so that will generally be lower as well.

However, it's important to keep in mind that cleartext vs encrypted traffic is not really comparable; the cost of encryption is quite high on raw throughput. For example, if I compare TCP to TCP/TLS, you see a similar effect.

$ PORT=3000 SCENARIO=./target/netbench/request_response.json ./target/release/netbench-driver-s2n-tls-server
$ SERVER_0=localhost:3000 SCENARIO=./target/netbench/request_response.json ./target/release/netbench-driver-s2n-tls-client
0:00:01.000698 throughput: rx=1.33GBps tx=999Bps
0:00:02.001713 throughput: rx=1.31GBps tx=0Bps
0:00:03.002700 throughput: rx=1.31GBps tx=0Bps
0:00:04.003702 throughput: rx=1.31GBps tx=0Bps
0:00:05.004695 throughput: rx=1.31GBps tx=0Bps
0:00:06.005701 throughput: rx=1.31GBps tx=0Bps
0:00:07.006694 throughput: rx=1.31GBps tx=0Bps
0:00:08.007694 throughput: rx=1.31GBps tx=0Bps

$ PORT=3000 SCENARIO=./target/netbench/request_response.json ./target/release/netbench-driver-tcp-server
$  SERVER_0=localhost:3000 SCENARIO=./target/netbench/request_response.json ./target/release/netbench-driver-tcp-client
0:00:01.000366 throughput: rx=4.79GBps tx=999Bps
0:00:02.001359 throughput: rx=5.13GBps tx=0Bps
0:00:03.002367 throughput: rx=4.99GBps tx=0Bps
0:00:04.003366 throughput: rx=5.07GBps tx=0Bps
0:00:05.004367 throughput: rx=4.93GBps tx=0Bps
0:00:06.005389 throughput: rx=5.07GBps tx=0Bps

We do have some optimizations planned for this year to close the gap between s2n-quic and TCP/TLS, so that should improve.

dignifiedquire commented 1 year ago

Thanks for the quick response.

This is what I get running on my linux machine

netbench-driver-s2n-quic-server

0:00:09.009077 throughput: rx=588.97MBps tx=0Bps
0:00:10.010063 throughput: rx=587.80MBps tx=0Bps
0:00:11.011046 throughput: rx=588.30MBps tx=0Bps
0:00:12.012046 throughput: rx=586.50MBps tx=0Bps
0:00:13.013223 throughput: rx=587.86MBps tx=0Bps
0:00:14.013751 throughput: rx=590.16MBps tx=0Bps
0:00:15.015065 throughput: rx=589.67MBps tx=0Bps
0:00:16.015823 throughput: rx=589.25MBps tx=0Bps
0:00:17.016755 throughput: rx=589.14MBps tx=0Bps
0:00:18.017976 throughput: rx=587.93MBps tx=0Bps

netbench-driver-s2n-tls-client

0:00:01.000043 throughput: rx=1.09GBps tx=999Bps
0:00:02.001035 throughput: rx=1.14GBps tx=0Bps
0:00:03.002024 throughput: rx=1.14GBps tx=0Bps
0:00:04.003027 throughput: rx=1.14GBps tx=0Bps
0:00:05.004023 throughput: rx=1.14GBps tx=0Bps
0:00:06.005036 throughput: rx=1.14GBps tx=0Bps
0:00:07.006032 throughput: rx=1.14GBps tx=0Bps
0:00:08.007030 throughput: rx=1.14GBps tx=0Bps
0:00:09.008024 throughput: rx=1.14GBps tx=0Bps

netbench-driver-tcp-client

0:00:01.000624 throughput: rx=3.22GBps tx=999Bps
0:00:02.001617 throughput: rx=3.31GBps tx=0Bps
0:00:03.002623 throughput: rx=3.32GBps tx=0Bps

dignifiedquire commented 1 year ago

On my mac I am getting

Error: "The connection was closed because the handshake took longer than the max handshake duration of 10s"

on the client side when running the quic one

camshaft commented 1 year ago

Our macOS bindings have some issues with dual-stack IP sockets. For some reason the socket isn't able to receive responses. This is noted in the netbench readme:

https://github.com/aws/s2n-quic/tree/main/netbench/netbench-driver#running-driver-tests

Note: if the netbench driver is being run on a mac, set the local IP on the client driver to 0.0.0.0 as follows: --local-ip 0.0.0.0

We have a pending issue to investigate this and fix it.

Are your plans to run over localhost in production? Or is this just for getting an idea of performance? Generally, UDP loopback is more expensive than sending/receiving on an actual NIC, especially with GSO support.

dignifiedquire commented 1 year ago

Are your plans to run over localhost in production

My use for localhost is two fold (1) I usually use it as a default benchmark to test overhead of other things, when the network is "removed" (2) I was experimenting with quic as an RPC layer, and so in that case it would be used both on localhost and on a local network.

dignifiedquire commented 1 year ago

Our macOS bindings have some issues with dual-stack IP sockets. For some reason the socket isn't able to receive responses. This is noted in the netbench readme:

Thanks, I missed that

camshaft commented 1 year ago

I have a very-much-WIP (only works on Linux ATM) branch that is able to push 16Gbps over localhost on my machine, which doubles what we do today and actually exceeds the perf of TLS.

https://github.com/aws/s2n-quic/tree/camshaft/multi-socket

I'm hoping to get all of this cleaned up and merged in the coming weeks.

dignifiedquire commented 1 year ago

@camshaft very cool, any high level comment on what you did to make this happen?

camshaft commented 1 year ago

There's a few things in there

Using jumbo frames - this makes the per packet overhead a lot cheaper
Using GRO on the receive side
Better handling of GSO assembly - the packet slots are now fully GSO aware so it's a bit cheaper to compare the current packet to the previous
Moving the socket calls to another task and sending packets over a lock-free, bounded spsc channel - the current event loop task has some fixed costs associated with it so by a avoiding waking that up if we don't need to gives another boost

aws / s2n-quic