F-Stack / f-stack

F-Stack is an user space network development kit with high performance based on DPDK, FreeBSD TCP/IP stack and coroutine API.
http://www.f-stack.org
Other
3.87k stars 898 forks source link

Nginx benchmark results in the CPS test #519

Open krizhanovsky opened 4 years ago

krizhanovsky commented 4 years ago

Hi,

we're testing our in-kernel HTTPS proxy against Nginx and compare our results with kernel-bypass proxies, so I came to your project.

I noticed in your performance data https://github.com/F-Stack/f-stack/blob/dev/CPS.png that Nginx on top of the Linux TCP/IP stack doesn't scale at all with increasing CPU number - why? Even having some hard lock contention, I would not expect to see absolutely flat performance curve for the Linux kernel and Nginx. For me it, seems there is some misconfiguration for Nginx... Could you please share the Nginx configuration file for the test? I appreciate much If you could show perf top for Nginx/Linux.

Also we found quite problematic to generate enough load to test high-performance HTTP server. For our case we needed more than 40 cores and 2 10G NICs for wrk to cause enough load to reach 100% of resources on our server on 4 cores. What did you use to get the maximum results for 20 cores?

Thanks in advance!

krizhanovsky commented 4 years ago

Well, I noticed the numbers difference for throughput: 0.34 for 1 CPU and up to 0.48 on 12 CPUs, so the difference is about 40%. Assuming that the RPS ratio is the same, the curve is still too flat...

Also https://github.com/F-Stack/f-stack#nginx-testing-result says that you used Linux 3.10.104, which was releases in October 2016 and is just a patch level of the original 3.10 from 2013. Having, that there were a lot of scalability improvements in the Linux TCP/IP stack during these 7 year, I'm wondering if you have performance comparison with the newer Linux TCP/IP stacks?

vincentmli commented 4 years ago

Hi,

we're testing our in-kernel HTTPS proxy against Nginx and compare our results with kernel-bypass proxies, so I come to your project.

I noticed in your data https://github.com/F-Stack/f-stack/blob/dev/CPS.png that Nginx on top of the Linux TCP/IP stack doesn't scale at all with increasing CPU number - why? Even having some hard lock contention, I would expect to see absolutely flat performance curve for the Linux kernel and the Nginx.

it is most likely the bottleneck of interrupt since the driver in Linux kernel runs in interrupt and poll mode together (NAPI), I have a video to show that:

https://youtu.be/d0vPUwJT1mw, at 1:34, the ksoftirqd is 100% CPU usage when under load, yet Nginx CPU usage is still ok

Also we found quite problematic to generate enough load to test high-performance HTTP server. For our case we needed more than 40 cores and 2 10G NICs for wrk to cause enough load to reach 100% of resources on our server on 4 cores. What did you use to get the maximum results for 20 cores?

the DPDK/mTCP project ported a multithread version of apache bench that could do high-performance HTTP server load test. I also made a PR HTTP SSL load test https://github.com/mtcp-stack/mtcp/pull/285, the apache bench statistics seems broken, but it does the job of high load/speed of web server load test

krizhanovsky commented 4 years ago

Hi @vincentmli ,

thank you very much for sharing the video - I really enjoyed to watch it (it was also quite interesting to learn more about BIG-IP traffic handling).

Now I see that the benchmark which I cared about CPS is for Nginx running without listen reuseport, which is a Nginx misconfiguration if one benchmarks connections per second. See the Nginx post and an LWN article:

The first of the traditional approaches is to have a single listener thread that accepts all incoming connections and then passes these off to other threads for processing. The problem with this approach is that the listening thread can become a bottleneck in extreme cases. In early discussions on SO_REUSEPORT, Tom noted that he was dealing with applications that accepted 40,000 connections per second.

The right benchmark result is CPS_Reuseport , where Nginx does scale on the Linux TCP/IP stack.

Next question is the Nginx configuration files for F-stack and the Linux TCP/IP stack cases. I had a look onto https://github.com/F-Stack/f-stack/blob/dev/app/nginx-1.16.1/conf/nginx.conf and there is an issue with the filesystem. You have switched sendfile() and access_log off and use static string for 600 byte response instead of a file. Do you know any real Nginx setup not using filesystem at all? In worst case I'd expect to see Nginx files on tmpfs - more or less usable case. But I reckon the numbers won't be so nice for F-stack if it uses real filesystem access.

Which configuration files have been used for the benchmark? What was the Linux sysctl settings? Which steps were made to optimize Nginx and the Linux TCP/IP stack to make the comparison fair? Was virtio-net multiqueue used for the Linux TCP/IP stack benchmarks?

vincentmli commented 4 years ago

Next question is the Nginx configuration files for F-stack and the Linux TCP/IP stack cases. I had a look onto https://github.com/F-Stack/f-stack/blob/dev/app/nginx-1.16.1/conf/nginx.conf and there is an issue with the filesystem. You have switched sendfile() and access_log off and use static string for 600 byte response instead of a file. Do you know any real Nginx setup not using filesystem at all? In worst case I'd expect to see Nginx files on tmpfs - more or less usable case. But I reckon the numbers won't be so nice for F-stack if it uses real filesystem access.

F-Stack improvements are on the NIC driver level, userspace DPDK poll mode driver vs Linux kernel Interrupt/Poll (NAPI), you get the DPDK benefit from F-Stack. The problem with DPDK is lack of mature TCP/IP stack, F-Stack glues the FreeBSD TCP/IP stack with DPDK together to solve the TCP/IP stack problem (F-Stack also has done some custom work in FreeBSD TCP/IP stack to fit in DPDK model as I understand it).

The sendfile and access_log are Nginx configuration that should be irrelevant to F-Stack, F-Stack is for Network I/O improvement, not for filesystem I/O like sendfile/access_log, though it would be interesting to test with and without sendfile/access_log, slow filesystem I/O could potentially affect network I/O if network is waiting for data from filesystem to transmit

Which configuration files have been used for the benchmark? What was the Linux sysctl settings? Which steps were made to optimize Nginx and the Linux TCP/IP stack to make the comparison fair? Was virtio-net multiqueue used for the Linux TCP/IP stack benchmarks?

I can't speak for F-Stack guys since I am just an observer, Linux TCP/IP stack is very complex stack and kind of bloated (in my opinion :)). I believe the F-Stack benchmark test is based on physical hardware, not VM virtio-net in KVM/Qemu, virtio-net does not support RSS, so you can only run F-Stack in single core and single queue. You could run SR-IOV VF that support RSS offload in hardware NIC with RSS support to scale in multi core VM with multi queue.

krizhanovsky commented 4 years ago

Hi @vincentmli ,

(F-Stack also has done some custom work in FreeBSD TCP/IP stack to fit in DPDK model as I understand it)

Well, I did some observations. E.g. there is a problem with sockets hash table in Linux. I cheched the F-stack and it seems the same problem exists. The hash is struct inpcbinfo, declared in freebsd/netinet/in_pcb.h and scanned for example by a in_pcblookup_mbuf() call from tcp_input() function. We see quite the similar read lock as for Linux in the hash lookup function:

        static struct inpcb *
        in_pcblookup_hash(...)
        {
            struct inpcb *inp;

            INP_HASH_RLOCK(pcbinfo);
            inp = in_pcblookup_hash_locked(...);
            ...

The sendfile and access_log are Nginx configuration that should be irrelevant to F-Stack, F-Stack is for Network I/O improvement

Agree. I mentioned the filesystem I/O because, even in pure non-caching proxy mode, Nginx is practically unusable, so the benchmarks are somewhat theoretical.

You could run SR-IOV VF that support RSS offload in hardware NIC with RSS support to scale in multi core VM with multi queue.

Unfortunately, a the moment I have no SR-IOV capable NICs to test, but I'm wondering if SR-IOV can be used in a VM the same way as a physical NIC on a hardware server? I.e. it seems using SR-IOV we can coalesce interrupts on the NIC inside a VM and tune ksoftirqd threads for polling mode. This way we get very close to DPDK solution, but which doesn't burn out power while the system is idle.

vincentmli commented 4 years ago

Unfortunately, a the moment I have no SR-IOV capable NICs to test, but I'm wondering if SR-IOV can be used in a VM the same way as a physical NIC on a hardware server? I.e. it seems using SR-IOV we can coalesce interrupts on the NIC inside a VM and tune ksoftirqd threads for polling mode. This way we get very close to DPDK solution, but which doesn't burn out power while the system is idle.

DPDK also support interrupt, there is an example https://github.com/DPDK/dpdk/tree/master/examples/l3fwd-power

krizhanovsky commented 4 years ago

That's not a real hardware interruption - the example uses epoll(7) (in Linux) system call for polling.

vincentmli commented 4 years ago

That's not a real hardware interruption - the example uses epoll(7) (in Linux) system call for polling.

https://github.com/DPDK/dpdk/blob/master/examples/l3fwd-power/main.c#L860 turn on/off hardware interrupt

krizhanovsky commented 4 years ago

Yeah, I meant that you still need to go to the kernel and back (i.e. make 2 context switches) if you use epoll(). There is no interruption handlers in the user space.

vincentmli commented 4 years ago

Yeah, I meant that you still need to go to the kernel and back (i.e. make 2 context switches) if you use epoll(). There is no interruption handlers in the user space.

as far as I understand it from reading the code, DPDK is handling the interrupt from userspace, epoll is just for event polling, not interrupt handling https://github.com/DPDK/dpdk/blob/master/lib/librte_eal/linux/eal_interrupts.c#L1167

krizhanovsky commented 4 years ago

Casually I came to this thread again with the question about interrupts handling with DPDK and I found the answer in a StackOverflow discussion https://stackoverflow.com/questions/53892565/dpdk-interrupts-rather-than-polling , So there is no real interrupts handling in DPDK.

osevan commented 3 years ago

Hi,

we're testing our in-kernel HTTPS proxy against Nginx and compare our results with kernel-bypass proxies, so I came to your project.

I noticed in your performance data https://github.com/F-Stack/f-stack/blob/dev/CPS.png that Nginx on top of the Linux TCP/IP stack doesn't scale at all with increasing CPU number - why? Even having some hard lock contention, I would not expect to see absolutely flat performance curve for the Linux kernel and Nginx. For me it, seems there is some misconfiguration for Nginx... Could you please share the Nginx configuration file for the test? I appreciate much If you could show perf top for Nginx/Linux.

Also we found quite problematic to generate enough load to test high-performance HTTP server. For our case we needed more than 40 cores and 2 10G NICs for wrk to cause enough load to reach 100% of resources on our server on 4 cores. What did you use to get the maximum results for 20 cores?

Thanks in advance!

Could you show me the price for newest kernel and latest gcc and llvm compiler compatibility with tempesta patches for one webserver what I have?

I couldn't find any prices.

I tested tempesta 2 years ago and it was cool with module compiling and so on.

Please share us the price

krizhanovsky commented 3 years ago

Hi @osevan ,

thank you for the request! Could you please drop me a message to ak@tempesta-tech.com or, better, schedule a call https://calendly.com/tempesta-tech/30min , so we can discuss your scenario and talk about Tempesta FW abilities.