F-Stack / f-stack

F-Stack is an user space network development kit with high performance based on DPDK, FreeBSD TCP/IP stack and coroutine API.
http://www.f-stack.org
Other
3.86k stars 898 forks source link

Performance degrades when using assigning multiple cores #150

Open SungHoHong2 opened 6 years ago

SungHoHong2 commented 6 years ago

Hello f-stack team, Currently, I am comparing DPDK performance between fstack and seastar and it seems I am missing something regarding to the configuration.

As I increase the number of cores by changing my configuration for example 2 cores running for f-stack epolling sample

I run this configuration file with the sh file below

sudo ./start.sh -b ./fstack-server -c config_server.ini
[dpdk]
## Hexadecimal bitmask of cores to run on.
lcore_mask=3
channel=4
promiscuous=1
numa_on=1
## TCP segment offload, default: disabled.
tso=0
## HW vlan strip, default: enabled.
vlan_strip=1

# enabled port list
#
# EBNF grammar:
#
#    exp      ::= num_list {"," num_list}
#    num_list ::= <num> | <range>
#    range    ::= <num>"-"<num>
#    num      ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
#
# examples
#    0-3       ports 0, 1,2,3 are enabled
#    1-3,4,7   ports 1,2,3,4,7 are enabled
port_list=1

## Port config section
## Correspond to dpdk.port_list's index: port0, port1...
[port1]
addr=10.107.30.40
netmask=255.255.254.0
broadcast=10.107.31.255
gateway=10.107.30.1

## lcore list used to handle this port
## the format is same as port_list
# lcore_list= 0

## Packet capture path, this will hurt performance
#pcap=./a.pcap

## Kni config: if enabled and method=reject,
## all packets that do not belong to the following tcp_port and udp_port
## will transmit to kernel; if method=accept, all packets that belong to
## the following tcp_port and udp_port will transmit to kernel.
#[kni]
#enable=1
#method=reject
## The format is same as port_list
#tcp_port=80,443
#udp_port=53

## FreeBSD network performance tuning configurations.
## Most native FreeBSD configurations are supported.
[freebsd.boot]
hz=100

## Block out a range of descriptors to avoid overlap
## with the kernel's descriptor space.
## You can increase this value according to your app.
fd_reserve=1024

kern.ipc.maxsockets=262144
net.inet.tcp.syncache.hashsize=4096
net.inet.tcp.syncache.bucketlimit=100
net.inet.tcp.tcbhashsize=65536

[freebsd.sysctl]
kern.ipc.somaxconn=32768
kern.ipc.maxsockbuf=16777216
net.link.ether.inet.maxhold=5

net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.sendspace=16384
net.inet.tcp.recvspace=8192
net.inet.tcp.nolocaltimewait=1
net.inet.tcp.cc.algorithm=cubic
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendbuf_auto=1
net.inet.tcp.recvbuf_auto=1
net.inet.tcp.sendbuf_inc=16384
net.inet.tcp.recvbuf_inc=524288
net.inet.tcp.sack.enable=1
net.inet.tcp.blackhole=1
net.inet.tcp.msl=2000
net.inet.tcp.delayed_ack=0

net.inet.udp.blackhole=1
net.inet.ip.redirect=0

Am I missing something? because the performance degrades greatly when I use 2 cores. When I run the wrk benchmark running with only one core I get

  4 threads and 4 connections                      
  Latency    98.39us      
  Transfer/sec:      24.78MB     

and When I run it with 2 cores I get

4 threads and 4 connections                      
Latency    98.57us           
Transfer/sec:      6.19MB                          

Which you can see the performance degrades into half....

Is there something I missed within the configuration?

whl739 commented 6 years ago

It doesn't make sense. But i didn't see any abnormal thing.

SungHoHong2 commented 6 years ago

So the way I have conducted the tests are correct right?

  1. masking the number of running cores using coremask
  2. run the start.sh which will run the program with number of cores

The strange thing is it does not seem to connect very well. when I do ping it pings but there were several occasions where connections are not working which means the server is not working well. just occasionally works and sometimes the client cannot find the server host at all

I will show you what I get from running 2 cores

sungho@c3n24:/data1/sungho/DPDK-Experiment/latency-experiment/f-stack-tcp$ sudo ./sta
t.sh -b ./fstack-server -c config_server.ini                                         
./fstack-server --conf config_server.ini --proc-type=primary --proc-id=0             
[dpdk]: lcore_mask=3                                                                 
[dpdk]: channel=4                                                                    
[dpdk]: promiscuous=1                                                                
[dpdk]: numa_on=1                                                                    
[dpdk]: tso=0                                                                        
[dpdk]: vlan_strip=1                                                                 
[dpdk]: port_list=1                                                                  
[port1]: addr=10.107.30.40                                                           
[port1]: netmask=255.255.254.0                                                       
[port1]: broadcast=10.107.31.255                                                     
[port1]: gateway=10.107.30.1                                                         

[freebsd.boot]: hz=100                                    
[freebsd.boot]: fd_reserve=1024                           
[freebsd.boot]: kern.ipc.maxsockets=262144                
[freebsd.boot]: net.inet.tcp.syncache.hashsize=4096       
[freebsd.boot]: net.inet.tcp.syncache.bucketlimit=100     
[freebsd.boot]: net.inet.tcp.tcbhashsize=65536                                                                  
[freebsd.sysctl]: kern.ipc.somaxconn=32768                                        
[freebsd.sysctl]: kern.ipc.maxsockbuf=16777216                                    
[freebsd.sysctl]: net.link.ether.inet.maxhold=5                                   
[freebsd.sysctl]: net.inet.tcp.fast_finwait2_recycle=1                            
[freebsd.sysctl]: net.inet.tcp.sendspace=16384                                    
[freebsd.sysctl]: net.inet.tcp.recvspace=8192                                     
[freebsd.sysctl]: net.inet.tcp.nolocaltimewait=1                                  
[freebsd.sysctl]: net.inet.tcp.cc.algorithm=cubic                                 
[freebsd.sysctl]: net.inet.tcp.sendbuf_max=16777216                               
[freebsd.sysctl]: net.inet.tcp.recvbuf_max=16777216                               
[freebsd.sysctl]: net.inet.tcp.sendbuf_auto=1                                     
[freebsd.sysctl]: net.inet.tcp.recvbuf_auto=1                                     
[freebsd.sysctl]: net.inet.tcp.sendbuf_inc=16384                                  
[freebsd.sysctl]: net.inet.tcp.recvbuf_inc=524288                                 
[freebsd.sysctl]: net.inet.tcp.sack.enable=1                                      
[freebsd.sysctl]: net.inet.tcp.blackhole=1                                        
[freebsd.sysctl]: net.inet.tcp.msl=2000                                           
[freebsd.sysctl]: net.inet.tcp.delayed_ack=0                                      
[freebsd.sysctl]: net.inet.udp.blackhole=1                                        
[freebsd.sysctl]: net.inet.ip.redirect=0                                          
EAL: Detected 32 lcore(s)                                                         
EAL: Probing VFIO support...                                                      
EAL: PCI device 0000:04:00.0 on NUMA socket 0                                     
EAL:   probe driver: 8086:1521 rte_igb_pmd                                        
EAL: PCI device 0000:04:00.1 on NUMA socket 0                                     
EAL:   probe driver: 8086:1521 rte_igb_pmd                                        
EAL: PCI device 0000:81:00.0 on NUMA socket 1                                     
EAL:   probe driver: 15b3:1003 librte_pmd_mlx4                                    
PMD: librte_pmd_mlx4: PCI information matches, using device "mlx4_0" (VF: false)  
PMD: librte_pmd_mlx4: 2 port(s) detected                                          
PMD: librte_pmd_mlx4: port 1 MAC address is e4:1d:2d:d9:cb:80                     
PMD: librte_pmd_mlx4: port 2 MAC address is e4:1d:2d:d9:cb:81                     
lcore: 0, port: 1, queue: 0                                                       
create mbuf pool on socket 0                                                      
create ring:dispatch_ring_p1_q0 success, 2047 ring entries are now free!          
create ring:dispatch_ring_p1_q1 success, 2047 ring entries are now free!          
Port 1 MAC: e4 1d 2d d9 cb 81                                                     
TSO is disabled                                                                   
PMD: librte_pmd_mlx4: 0x8b7880: TX queues number update: 0 -> 2                   
PMD: librte_pmd_mlx4: 0x8b7880: RX queues number update: 0 -> 2                   
set port 1 to promiscuous mode ok                                                 

Checking link statusdone                                                              
Port 1 Link Up - speed 10000 Mbps - full-duplex                                       
link_elf_lookup_symbol: missing symbol hash table                                     
link_elf_lookup_symbol: missing symbol hash table                                     
netisr_init: forcing maxthreads from 1 to 0                                           
Timecounters tick every 10.000 msec                                                   
Timecounter "ff_clock" frequency 100 Hz quality 1                                     
f-stack-1: Ethernet address: e4:1d:2d:d9:cb:81                                        
server pktsize: 3000                                                                  
server is running...                                                                  
./fstack-server --conf config_server.ini --proc-type=secondary --proc-id=1            
sungho@c3n24:/data1/sungho/DPDK-Experiment/latency-experiment/f-stack-tcp$ [dpdk]: lco
re_mask=3                                                                             
[dpdk]: channel=4                                                                     
[dpdk]: promiscuous=1                                                                 
[dpdk]: numa_on=1                                                                     
[dpdk]: tso=0                                                                         
[dpdk]: vlan_strip=1                                                                  
[dpdk]: port_list=1                                                                   
[port1]: addr=10.107.30.40                                                            
[port1]: netmask=255.255.254.0                                                        
[port1]: broadcast=10.107.31.255                                                      
[port1]: gateway=10.107.30.1                                                          
[freebsd.boot]: hz=100                                                                
[freebsd.boot]: fd_reserve=1024                                                       
[freebsd.boot]: kern.ipc.maxsockets=262144                                            
[freebsd.boot]: net.inet.tcp.syncache.hashsize=4096                                   
[freebsd.boot]: net.inet.tcp.syncache.bucketlimit=100                                 
[freebsd.boot]: net.inet.tcp.tcbhashsize=65536                                        
[freebsd.sysctl]: kern.ipc.somaxconn=32768                                            
[freebsd.sysctl]: kern.ipc.maxsockbuf=16777216                                        
[freebsd.sysctl]: net.link.ether.inet.maxhold=5                                       
[freebsd.sysctl]: net.inet.tcp.fast_finwait2_recycle=1                            
[freebsd.sysctl]: net.inet.tcp.sendspace=16384                                    
[freebsd.sysctl]: net.inet.tcp.recvspace=8192                                     
[freebsd.sysctl]: net.inet.tcp.nolocaltimewait=1                                  
[freebsd.sysctl]: net.inet.tcp.cc.algorithm=cubic                                 
[freebsd.sysctl]: net.inet.tcp.sendbuf_max=16777216                               
[freebsd.sysctl]: net.inet.tcp.recvbuf_max=16777216                               
[freebsd.sysctl]: net.inet.tcp.sendbuf_auto=1                                     
[freebsd.sysctl]: net.inet.tcp.recvbuf_auto=1                                     
[freebsd.sysctl]: net.inet.tcp.sendbuf_inc=16384                                  
[freebsd.sysctl]: net.inet.tcp.recvbuf_inc=524288                                 
[freebsd.sysctl]: net.inet.tcp.sack.enable=1                                      
[freebsd.sysctl]: net.inet.tcp.blackhole=1 
[freebsd.sysctl]: net.inet.tcp.msl=2000    
[freebsd.sysctl]: net.inet.tcp.delayed_ack=0                                      
[freebsd.sysctl]: net.inet.udp.blackhole=1 
[freebsd.sysctl]: net.inet.ip.redirect=0   
EAL: Detected 32 lcore(s)                  
EAL: Probing VFIO support...               
EAL: WARNING: Address Space Layout Randomization (ASLR) is enabled in the kernel. 
EAL:    This may cause issues with mapping memory into secondary processes        
EAL: PCI device 0000:04:00.0 on NUMA socket 0                                     
EAL:   probe driver: 8086:1521 rte_igb_pmd 
EAL: PCI device 0000:04:00.1 on NUMA socket 0                                     
EAL:   probe driver: 8086:1521 rte_igb_pmd 
EAL: PCI device 0000:81:00.0 on NUMA socket 1                                     
EAL:   probe driver: 15b3:1003 librte_pmd_mlx4                                    
PMD: librte_pmd_mlx4: PCI information matches, using device "mlx4_0" (VF: false)  
PMD: librte_pmd_mlx4: 2 port(s) detected   
PMD: librte_pmd_mlx4: port 1 MAC address is e4:1d:2d:d9:cb:80                     
PMD: librte_pmd_mlx4: port 2 MAC address is e4:1d:2d:d9:cb:81                     
lcore: 1, port: 1, queue: 1                
create mbuf pool on socket 0               
create ring:dispatch_ring_p1_q0 success, 2047 ring entries are now free!          
create ring:dispatch_ring_p1_q1 success, 2047 ring entries are now free!          
Port 1 MAC: e4 1d 2d d9 cb 81              
TSO is disabled      
link_elf_lookup_symbol: missing symbol hash table                                 
link_elf_lookup_symbol: missing symbol hash table                                 
netisr_init: forcing maxthreads from 1 to 0                                       
Timecounters tick every 10.000 msec        
Timecounter "ff_clock" frequency 100 Hz quality 1                                 
f-stack-1: Ethernet address: e4:1d:2d:d9:cb:81                                    
server pktsize: 3000 
server is running... 

Can't see anything missing at all?

whl739 commented 6 years ago

I didn't see any thing wrong with the program. There must be some thing wrong with your environment. And the performance must be almost double with 2 cores.

SungHoHong2 commented 6 years ago

I have compared the results with the workstation (where the multiple cores are working fine) and the cluster (where the multiple cores degrade the performance)

The left is the workstation and the right is the cluster

One noticable thing I have found was that the workstation shows a print that the server don't . Would it be possible to ask you whether if you are familiar with this print?

TX ip checksum offload supported      
TX TCP&UDP checksum offload supported 
b ./fstack-server -c config_server.ini                                                │t.sh -b ./fstack-server -c config_server.ini
./fstack-server --conf config_server.ini --proc-type=primary --proc-id=0              │./fstack-server --conf config_server.ini --proc-type=primary --proc-id=0
[dpdk]: lcore_mask=3                                                                  │[dpdk]: lcore_mask=3
[dpdk]: channel=4                                                                     │[dpdk]: channel=4
[dpdk]: promiscuous=1                                                                 │[dpdk]: promiscuous=1
[dpdk]: numa_on=1                                                                     │[dpdk]: numa_on=1
[dpdk]: tso=0                                                                         │[dpdk]: tso=0
[dpdk]: vlan_strip=1                                                                  │[dpdk]: vlan_strip=1
[dpdk]: port_list=1                                                                   │[dpdk]: port_list=1
[port1]: addr=10.218.111.253                                                          │[port1]: addr=10.107.30.40
[port1]: netmask=255.255.248.0                                                        │[port1]: netmask=255.255.254.0
[port1]: broadcast=10.218.111.255                                                     │[port1]: broadcast=10.107.31.255
[port1]: gateway=10.218.111.1                                                         │[port1]: gateway=10.107.30.1
[freebsd.boot]: hz=100                                                                │[freebsd.boot]: hz=100
[freebsd.boot]: fd_reserve=1024                                                       │[freebsd.boot]: fd_reserve=1024
[freebsd.boot]: kern.ipc.maxsockets=262144                                            │[freebsd.boot]: kern.ipc.maxsockets=262144
[freebsd.boot]: net.inet.tcp.syncache.hashsize=4096                                   │[freebsd.boot]: net.inet.tcp.syncache.hashsize=4096
[freebsd.boot]: net.inet.tcp.syncache.bucketlimit=100                                 │[freebsd.boot]: net.inet.tcp.syncache.bucketlimit=100
[freebsd.boot]: net.inet.tcp.tcbhashsize=65536                                        │[freebsd.boot]: net.inet.tcp.tcbhashsize=65536
[freebsd.sysctl]: kern.ipc.somaxconn=32768                                            │[freebsd.sysctl]: kern.ipc.somaxconn=32768
[freebsd.sysctl]: kern.ipc.maxsockbuf=16777216                                        │[freebsd.sysctl]: kern.ipc.maxsockbuf=16777216
[freebsd.sysctl]: net.link.ether.inet.maxhold=5                                       │[freebsd.sysctl]: net.link.ether.inet.maxhold=5
[freebsd.sysctl]: net.inet.tcp.fast_finwait2_recycle=1                                │[freebsd.sysctl]: net.inet.tcp.fast_finwait2_recycle=1
[freebsd.sysctl]: net.inet.tcp.sendspace=16384                                        │[freebsd.sysctl]: net.inet.tcp.sendspace=16384
[freebsd.sysctl]: net.inet.tcp.recvspace=8192                                         │[freebsd.sysctl]: net.inet.tcp.recvspace=8192
[freebsd.sysctl]: net.inet.tcp.nolocaltimewait=1                                      │[freebsd.sysctl]: net.inet.tcp.nolocaltimewait=1
[freebsd.sysctl]: net.inet.tcp.cc.algorithm=cubic                                     │[freebsd.sysctl]: net.inet.tcp.cc.algorithm=cubic
[freebsd.sysctl]: net.inet.tcp.sendbuf_max=16777216                                   │[freebsd.sysctl]: net.inet.tcp.sendbuf_max=16777216
[freebsd.sysctl]: net.inet.tcp.recvbuf_max=16777216                                   │[freebsd.sysctl]: net.inet.tcp.recvbuf_max=16777216
[freebsd.sysctl]: net.inet.tcp.sendbuf_auto=1                                         │[freebsd.sysctl]: net.inet.tcp.sendbuf_auto=1
[freebsd.sysctl]: net.inet.tcp.recvbuf_auto=1                                         │[freebsd.sysctl]: net.inet.tcp.recvbuf_auto=1
[freebsd.sysctl]: net.inet.tcp.sendbuf_inc=16384                                      │[freebsd.sysctl]: net.inet.tcp.sendbuf_inc=16384
[freebsd.sysctl]: net.inet.tcp.recvbuf_inc=524288                                     │[freebsd.sysctl]: net.inet.tcp.recvbuf_inc=524288
[freebsd.sysctl]: net.inet.tcp.sack.enable=1                                          │[freebsd.sysctl]: net.inet.tcp.sack.enable=1
[freebsd.sysctl]: net.inet.tcp.blackhole=1                                            │[freebsd.sysctl]: net.inet.tcp.blackhole=1
[freebsd.sysctl]: net.inet.tcp.msl=2000                                               │[freebsd.sysctl]: net.inet.tcp.msl=2000
[freebsd.sysctl]: net.inet.tcp.delayed_ack=0                                          │[freebsd.sysctl]: net.inet.tcp.delayed_ack=0
[freebsd.sysctl]: net.inet.udp.blackhole=1                                            │[freebsd.sysctl]: net.inet.udp.blackhole=1
[freebsd.sysctl]: net.inet.ip.redirect=0                                              │[freebsd.sysctl]: net.inet.ip.redirect=0
EAL: Detected 8 lcore(s)                                                              │EAL: Detected 32 lcore(s)

EAL: Detected 8 lcore(s)                                                              │[freebsd.sysctl]: net.inet.ip.redirect=0
EAL: Probing VFIO support...                                                          │EAL: Detected 32 lcore(s)
EAL: PCI device 0000:00:19.0 on NUMA socket 0                                         │EAL: Probing VFIO support...
EAL:   probe driver: 8086:153a rte_em_pmd                                             │EAL: PCI device 0000:04:00.0 on NUMA socket 0
EAL: PCI device 0000:04:00.0 on NUMA socket 0                                         │EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   probe driver: 8086:1521 rte_igb_pmd                                            │EAL: PCI device 0000:04:00.1 on NUMA socket 0
EAL: PCI device 0000:04:00.1 on NUMA socket 0                                         │EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   probe driver: 8086:1521 rte_igb_pmd                                            │EAL: PCI device 0000:81:00.0 on NUMA socket 1
EAL: PCI device 0000:04:00.2 on NUMA socket 0                                         │EAL:   probe driver: 15b3:1003 librte_pmd_mlx4
EAL:   probe driver: 8086:1521 rte_igb_pmd                                            │PMD: librte_pmd_mlx4: PCI information matches, using device "mlx4_0" (VF: false)
EAL: PCI device 0000:04:00.3 on NUMA socket 0                                         │PMD: librte_pmd_mlx4: 2 port(s) detected
EAL:   probe driver: 8086:1521 rte_igb_pmd                                            │PMD: librte_pmd_mlx4: port 1 MAC address is e4:1d:2d:d9:cb:80
lcore: 0, port: 1, queue: 0                                                           │PMD: librte_pmd_mlx4: port 2 MAC address is e4:1d:2d:d9:cb:81
create mbuf pool on socket 0                                                          │lcore: 0, port: 1, queue: 0
create ring:dispatch_ring_p1_q0 success, 2047 ring entries are now free!              │create mbuf pool on socket 0
create ring:dispatch_ring_p1_q1 success, 2047 ring entries are now free!              │create ring:dispatch_ring_p1_q0 success, 2047 ring entries are now free!
Port 1 MAC: a0 36 9f 83 ab bd                                                         │create ring:dispatch_ring_p1_q1 success, 2047 ring entries are now free!
RX checksum offload supported                                                         │Port 1 MAC: e4 1d 2d d9 cb 81
TX ip checksum offload supported                                                      │TSO is disabled
TX TCP&UDP checksum offload supported                                                 │PMD: librte_pmd_mlx4: 0x8b7880: TX queues number update: 0 -> 2
TSO is disabled                                                                       │PMD: librte_pmd_mlx4: 0x8b7880: RX queues number update: 0 -> 2
port[1]: rss table size: 128                                                          │set port 1 to promiscuous mode ok
set port 1 to promiscuous mode ok                                                     │
                                                                                      │Checking link statusdone
Checking link status.......................done                                       │Port 1 Link Up - speed 10000 Mbps - full-duplex
Port 1 Link Up - speed 100 Mbps - full-duplex                                         │link_elf_lookup_symbol: missing symbol hash table
link_elf_lookup_symbol: missing symbol hash table                                     │link_elf_lookup_symbol: missing symbol hash table
link_elf_lookup_symbol: missing symbol hash table                                     │netisr_init: forcing maxthreads from 1 to 0
netisr_init: forcing maxthreads from 1 to 0                                           │Timecounters tick every 10.000 msec
Timecounters tick every 10.000 msec                                                   │Timecounter "ff_clock" frequency 100 Hz quality 1
Timecounter "ff_clock" frequency 100 Hz quality 1                                     │f-stack-1: Ethernet address: e4:1d:2d:d9:cb:81
f-stack-1: Ethernet address: a0:36:9f:83:ab:bd                                        │server pktsize: 3000
server pktsize: 3000                                                                  │server is running...
server is running...                                                                  │./fstack-server --conf config_server.ini --proc-type=secondary --proc-id=1
./fstack-server --conf config_server.ini --proc-type=secondary --proc-id=1            │sungho@c3n24:/data1/sungho/DPDK-Experiment/latency-experiment/f-stack-tcp$ [dpdk]: lco
[dpdk]: lcore_mask=3                                                                  │re_mask=3
[dpdk]: channel=4                                                                     │[dpdk]: channel=4
[dpdk]: promiscuous=1                                                                 │[dpdk]: promiscuous=1
[dpdk]: numa_on=1                                                                     │[dpdk]: numa_on=1
[dpdk]: tso=0                                                                         │[dpdk]: tso=0

server is running...                                                         [26/1013]│server is running...                                                          [24/614]
./fstack-server --conf config_server.ini --proc-type=secondary --proc-id=1            │./fstack-server --conf config_server.ini --proc-type=secondary --proc-id=1
[dpdk]: lcore_mask=3                                                                  │sungho@c3n24:/data1/sungho/DPDK-Experiment/latency-experiment/f-stack-tcp$ [dpdk]: lco
[dpdk]: channel=4                                                                     │re_mask=3
[dpdk]: promiscuous=1                                                                 │[dpdk]: channel=4
[dpdk]: numa_on=1                                                                     │[dpdk]: promiscuous=1
[dpdk]: tso=0                                                                         │[dpdk]: numa_on=1
[dpdk]: vlan_strip=1                                                                  │[dpdk]: tso=0
[dpdk]: port_list=1                                                                   │[dpdk]: vlan_strip=1
[port1]: addr=10.218.111.253                                                          │[dpdk]: port_list=1
[port1]: netmask=255.255.248.0                                                        │[port1]: addr=10.107.30.40
[port1]: broadcast=10.218.111.255                                                     │[port1]: netmask=255.255.254.0
[port1]: gateway=10.218.111.1                                                         │[port1]: broadcast=10.107.31.255
[freebsd.boot]: hz=100                                                                │[port1]: gateway=10.107.30.1
[freebsd.boot]: fd_reserve=1024                                                       │[freebsd.boot]: hz=100
[freebsd.boot]: kern.ipc.maxsockets=262144                                            │[freebsd.boot]: fd_reserve=1024
[freebsd.boot]: net.inet.tcp.syncache.hashsize=4096                                   │[freebsd.boot]: kern.ipc.maxsockets=262144
[freebsd.boot]: net.inet.tcp.syncache.bucketlimit=100                                 │[freebsd.boot]: net.inet.tcp.syncache.hashsize=4096
[freebsd.boot]: net.inet.tcp.tcbhashsize=65536                                        │[freebsd.boot]: net.inet.tcp.syncache.bucketlimit=100
[freebsd.sysctl]: kern.ipc.somaxconn=32768                                            │[freebsd.boot]: net.inet.tcp.tcbhashsize=65536
[freebsd.sysctl]: kern.ipc.maxsockbuf=16777216                                        │[freebsd.sysctl]: kern.ipc.somaxconn=32768
[freebsd.sysctl]: net.link.ether.inet.maxhold=5                                       │[freebsd.sysctl]: kern.ipc.maxsockbuf=16777216
[freebsd.sysctl]: net.inet.tcp.fast_finwait2_recycle=1                                │[freebsd.sysctl]: net.link.ether.inet.maxhold=5
[freebsd.sysctl]: net.inet.tcp.sendspace=16384                                        │[freebsd.sysctl]: net.inet.tcp.fast_finwait2_recycle=1
[freebsd.sysctl]: net.inet.tcp.recvspace=8192                                         │[freebsd.sysctl]: net.inet.tcp.sendspace=16384
[freebsd.sysctl]: net.inet.tcp.nolocaltimewait=1                                      │[freebsd.sysctl]: net.inet.tcp.recvspace=8192
[freebsd.sysctl]: net.inet.tcp.cc.algorithm=cubic                                     │[freebsd.sysctl]: net.inet.tcp.nolocaltimewait=1
[freebsd.sysctl]: net.inet.tcp.sendbuf_max=16777216                                   │[freebsd.sysctl]: net.inet.tcp.cc.algorithm=cubic
[freebsd.sysctl]: net.inet.tcp.recvbuf_max=16777216                                   │[freebsd.sysctl]: net.inet.tcp.sendbuf_max=16777216
[freebsd.sysctl]: net.inet.tcp.sendbuf_auto=1                                         │[freebsd.sysctl]: net.inet.tcp.recvbuf_max=16777216
[freebsd.sysctl]: net.inet.tcp.recvbuf_auto=1                                         │[freebsd.sysctl]: net.inet.tcp.sendbuf_auto=1
[freebsd.sysctl]: net.inet.tcp.sendbuf_inc=16384                                      │[freebsd.sysctl]: net.inet.tcp.recvbuf_auto=1
[freebsd.sysctl]: net.inet.tcp.recvbuf_inc=524288                                     │[freebsd.sysctl]: net.inet.tcp.sendbuf_inc=16384
[freebsd.sysctl]: net.inet.tcp.sack.enable=1                                          │[freebsd.sysctl]: net.inet.tcp.recvbuf_inc=524288
[freebsd.sysctl]: net.inet.tcp.blackhole=1                                            │[freebsd.sysctl]: net.inet.tcp.sack.enable=1
[freebsd.sysctl]: net.inet.tcp.msl=2000                                               │[freebsd.sysctl]: net.inet.tcp.blackhole=1
[freebsd.sysctl]: net.inet.tcp.delayed_ack=0                                          │[freebsd.sysctl]: net.inet.tcp.msl=2000
[freebsd.sysctl]: net.inet.udp.blackhole=1                                            │[freebsd.sysctl]: net.inet.tcp.delayed_ack=0
[freebsd.sysctl]: net.inet.ip.redirect=0                                              │[freebsd.sysctl]: net.inet.udp.blackhole=1
root@wenji-wrk:/home/sungho/DPDK-Experiment/latency-lab/f-stack-tcp# EAL: Detected 8 l│[freebsd.sysctl]: net.inet.ip.redirect=0
core(s)                                                                               │EAL: Detected 32 lcore(s)
EAL: Probing VFIO support...                                                          │EAL: Probing VFIO support...
EAL: WARNING: Address Space Layout Randomization (ASLR) is enabled in the kernel.     │EAL: WARNING: Address Space Layout Randomization (ASLR) is enabled in the kernel.
EAL:    This may cause issues with mapping memory into secondary processes            │EAL:    This may cause issues with mapping memory into secondary processes
EAL: PCI device 0000:00:19.0 on NUMA socket 0                                         │EAL: PCI device 0000:04:00.0 on NUMA socket 0
EAL:   probe driver: 8086:153a rte_em_pmd                                             │EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL: PCI device 0000:04:00.0 on NUMA socket 0                                         │EAL: PCI device 0000:04:00.1 on NUMA socket 0

root@wenji-wrk:/home/sungho/DPDK-Experiment/latency-lab/f-stack-tcp# EAL: Detected 8 l│[freebsd.sysctl]: net.inet.tcp.blackhole=1
core(s)                                                                               │[freebsd.sysctl]: net.inet.tcp.msl=2000
EAL: Probing VFIO support...                                                          │[freebsd.sysctl]: net.inet.tcp.delayed_ack=0
EAL: WARNING: Address Space Layout Randomization (ASLR) is enabled in the kernel.     │[freebsd.sysctl]: net.inet.udp.blackhole=1
EAL:    This may cause issues with mapping memory into secondary processes            │[freebsd.sysctl]: net.inet.ip.redirect=0
EAL: PCI device 0000:00:19.0 on NUMA socket 0                                         │EAL: Detected 32 lcore(s)
EAL:   probe driver: 8086:153a rte_em_pmd                                             │EAL: Probing VFIO support...
EAL: PCI device 0000:04:00.0 on NUMA socket 0                                         │EAL: WARNING: Address Space Layout Randomization (ASLR) is enabled in the kernel.
EAL:   probe driver: 8086:1521 rte_igb_pmd                                            │EAL:    This may cause issues with mapping memory into secondary processes
EAL: PCI device 0000:04:00.1 on NUMA socket 0                                         │EAL: PCI device 0000:04:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd                                            │EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL: PCI device 0000:04:00.2 on NUMA socket 0                                         │EAL: PCI device 0000:04:00.1 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd                                            │EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL: PCI device 0000:04:00.3 on NUMA socket 0                                         │EAL: PCI device 0000:81:00.0 on NUMA socket 1
EAL:   probe driver: 8086:1521 rte_igb_pmd                                            │EAL:   probe driver: 15b3:1003 librte_pmd_mlx4
lcore: 1, port: 1, queue: 1                                                           │PMD: librte_pmd_mlx4: PCI information matches, using device "mlx4_0" (VF: false)
create mbuf pool on socket 0                                                          │PMD: librte_pmd_mlx4: 2 port(s) detected
create ring:dispatch_ring_p1_q0 success, 2047 ring entries are now free!              │PMD: librte_pmd_mlx4: port 1 MAC address is e4:1d:2d:d9:cb:80
create ring:dispatch_ring_p1_q1 success, 2021 ring entries are now free!              │PMD: librte_pmd_mlx4: port 2 MAC address is e4:1d:2d:d9:cb:81
Port 1 MAC: a0 36 9f 83 ab bd                                                         │lcore: 1, port: 1, queue: 1
RX checksum offload supported                                                         │create mbuf pool on socket 0
TX ip checksum offload supported                                                      │create ring:dispatch_ring_p1_q0 success, 2047 ring entries are now free!
TX TCP&UDP checksum offload supported                                                 │create ring:dispatch_ring_p1_q1 success, 2047 ring entries are now free!
TSO is disabled                                                                       │Port 1 MAC: e4 1d 2d d9 cb 81
port[1]: rss table size: 128                                                          │TSO is disabled
link_elf_lookup_symbol: missing symbol hash table                                     │link_elf_lookup_symbol: missing symbol hash table
link_elf_lookup_symbol: missing symbol hash table                                     │link_elf_lookup_symbol: missing symbol hash table
netisr_init: forcing maxthreads from 1 to 0                                           │netisr_init: forcing maxthreads from 1 to 0
Timecounters tick every 10.000 msec                                                   │Timecounters tick every 10.000 msec
Timecounter "ff_clock" frequency 100 Hz quality 1                                     │Timecounter "ff_clock" frequency 100 Hz quality 1
f-stack-1: Ethernet address: a0:36:9f:83:ab:bd                                        │f-stack-1: Ethernet address: e4:1d:2d:d9:cb:81
server pktsize: 3000                                                                  │server pktsize: 3000
server is running...                                                                  │server is running...

Do you think this part has something to do with the problem in the cluster? Because my workstation uses e1000 Intel and the cluster uses Mellanox ConnectX-3 And the Cluster does not have these features below

       /* Set Rx checksum checking */
        if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_IPV4_CKSUM) &&
            (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_UDP_CKSUM) &&
            (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_CKSUM)) {
            printf("RX checksum offload supported\n");
            port_conf.rxmode.hw_ip_checksum = 1;
            pconf->hw_features.rx_csum = 1;
        }

        if ((dev_info.tx_offload_capa & DEV_TX_OFFLOAD_IPV4_CKSUM)) {
            printf("TX ip checksum offload supported\n");
            pconf->hw_features.tx_csum_ip = 1;
        }

        if ((dev_info.tx_offload_capa & DEV_TX_OFFLOAD_UDP_CKSUM) &&
            (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_TCP_CKSUM)) {
            printf("TX TCP&UDP checksum offload supported\n");
            pconf->hw_features.tx_csum_l4 = 1;
        }

      if (dev_info.reta_size) {
            /* reta size must be power of 2 */
            assert((dev_info.reta_size & (dev_info.reta_size - 1)) == 0);

            rss_reta_size[port_id] = dev_info.reta_size;
            printf("port[%d]: rss table size: %d\n", port_id,
                dev_info.reta_size);
        }
SungHoHong2 commented 6 years ago

After comparing the results between the workstation(where multiple cores for f-stack is working) and the servers(multiple cores are not working)

I think the server's failure of initializing RSS function is the main problem for the multiple cores are not working properly.

because in the f-stack the default configuration for DPDK enables RSS in ff_dpdk_if.c

Do you think this is the main reason for the multi processes not working? because in seastar when you assign multiple processes it creates multiple rings, but I think the f-stack initialize multiple queues when we assign multiple cores for the port. and the current Mellanox ConnectX-3 driver may not support this function.

static struct rte_eth_conf default_port_conf = {
    .rxmode = {
        .mq_mode = ETH_MQ_RX_RSS,
        .max_rx_pkt_len = ETHER_MAX_LEN,
        .split_hdr_size = 0, /**< hdr buf size */
        .header_split   = 0, /**< Header Split disabled */
        .hw_ip_checksum = 0, /**< IP checksum offload disabled */
        .hw_vlan_filter = 0, /**< VLAN filtering disabled */
        .hw_vlan_strip  = 0, /**< VLAN strip disabled. */
        .hw_vlan_extend = 0, /**< Extended VLAN disabled. */
        .jumbo_frame    = 0, /**< Jumbo Frame Support disabled */
        .hw_strip_crc   = 0, /**< CRC stripped by hardware */
        .enable_lro     = 0, /**< LRO disabled */
    },
    .rx_adv_conf = {
        .rss_conf = {
            .rss_key = default_rsskey_40bytes,
            .rss_key_len = 40,
            .rss_hf = ETH_RSS_PROTO_MASK,
        },
    },
    .txmode = {
        .mq_mode = ETH_MQ_TX_NONE,
    },
};

I assume this is the problem for performance degrading in f-stack. but I think I need the experts advice to confirm my theory.

whl739 commented 6 years ago

I think you are right. See the below link: http://dpdk.org/ml/archives/users/2016-June/000625.html http://dpdk.org/doc/guides/nics/mlx4.html

SungHoHong2 commented 6 years ago

Thanks for the reply, However, I asked the Mellanox team and they say that RSS will have no effect on the performance, and what I want to know is Does RSS have a crucial role for running f-stack in multiple cores?

Because I was able to run seastar using multiple cores and the results were reasonable. and this was available with the same NIC (Connect-X3).

I'am trying to figure out why the rss is not on but it will be extremely helpful to find out whether RSS is indeed the issue.

Here is a bit more information, if it will help runing test-pmd in DPDK

Mellanox connect-X3 ( multiple cores not properly working )

********************* Infos for port 1  *********************
MAC address: E4:1D:2D:D9:CB:81                               
Connect to socket: 0                                         
memory allocation on the socket: 0                           
Link status: up                                              
Link speed: 10000 Mbps                                       
Link duplex: full-duplex                                     
Promiscuous mode: enabled                                    
Allmulticast mode: disabled                                  
Maximum number of MAC addresses: 127                         
Maximum number of MAC addresses of hash filtering: 0         
VLAN offload:                                                
  strip on                                                   
  filter on                                                  
  qinq(extend) off                                           
No flow type is supported.                                   
Max possible RX queues: 65408                                
Max possible number of RXDs per queue: 65535                 
Min possible number of RXDs per queue: 0                     
RXDs number alignment: 1                                     
Max possible TX queues: 65408                                
Max possible number of TXDs per queue: 65535                 
Min possible number of TXDs per queue: 0                     
TXDs number alignment: 1                                     

running with i1000e ( multiple cores working )

********************* Infos for port 1  ********************* 
MAC address: A0:36:9F:83:AB:BD             
Driver name: net_e1000_igb                 
Connect to socket: 0 
memory allocation on the socket: 0         
Link status: up      
Link speed: 100 Mbps 
Link duplex: full-duplex                   
MTU: 1500            
Promiscuous mode: enabled                  
Allmulticast mode: disabled                
Maximum number of MAC addresses: 32        
Maximum number of MAC addresses of hash filtering: 0          
VLAN offload:        
  strip on           
  filter on          
  qinq(extend) off   
Hash key size in bytes: 40                 
Redirection table size: 128                
Supported flow types:
  ipv4               
  ipv4-tcp           
  ipv4-udp           
  ipv6               
  ipv6-tcp           
  ipv6-udp           
  unknown            
  unknown            
  unknown            
Max possible RX queues: 8                  
Max possible number of RXDs per queue: 4096
Min possible number of RXDs per queue: 32  
RXDs number alignment: 8                   
Max possible TX queues: 8                  
Max possible number of TXDs per queue: 4096
Min possible number of TXDs per queue: 32  
TXDs number alignment: 8                   
SungHoHong2 commented 6 years ago

I have checked how the seastar works with multiple cores and here is the part

  if (smp::count > 1) {
        if (_dev_info.hash_key_size == 40) {
            _rss_key = default_rsskey_40bytes;
        } else if (_dev_info.hash_key_size == 52) {
            _rss_key = default_rsskey_52bytes;
        } else if (_dev_info.hash_key_size != 0) {
            // WTF?!!
            rte_exit(EXIT_FAILURE,
                "Port %d: We support only 40 or 52 bytes RSS hash keys, %d bytes key requested",
                _port_idx, _dev_info.hash_key_size);
        } else {
            _rss_key = default_rsskey_40bytes;
            _dev_info.hash_key_size = 40;
        }

        port_conf.rxmode.mq_mode = ETH_MQ_RX_RSS;
        port_conf.rx_adv_conf.rss_conf.rss_hf = ETH_RSS_PROTO_MASK;
        if (_dev_info.hash_key_size) {
            port_conf.rx_adv_conf.rss_conf.rss_key = const_cast<uint8_t *>(_rss_key.data());
            port_conf.rx_adv_conf.rss_conf.rss_key_len = _dev_info.hash_key_size;
        }
    } else {
        port_conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
    }

and the f-stack part will be

/* Set RSS mode */
port_conf.rxmode.mq_mode = ETH_MQ_RX_RSS;
port_conf.rx_adv_conf.rss_conf.rss_hf = ETH_RSS_PROTO_MASK;
port_conf.rx_adv_conf.rss_conf.rss_key = default_rsskey_40bytes;
port_conf.rx_adv_conf.rss_conf.rss_key_len = 40;

I made every dpdk configuration of seastack identical to the fstack but could not find the reason for the degrading performance.

Did the f-stack team tried the performance test on Ethernet controller: Mellanox Technologies MT27500 Family this driver?

At this point seastar and fstack both enable RSS when they use multiple cores and the DPDK configuration is the same. However, if I use the Mellanox driver the fstack results will degrade when I introduce more than 1 number of cores.

Does the f-stack have a specific Mellanox driver that it prefers?

daovanhuy commented 6 years ago

As i know "Does RSS have a crucial role for running f-stack in multiple cores?" is true. f-stack in multiple cores base on nic queues, each core polling packets from one nic queue. If RSS disable or not distribute packets by TCP flows => performance will degrade because TCP packets missing (packets 1 of TCP flows A distribute to nic queue 0 by RSS and will be handle by core 0, but packet 2 of TCP flows A may distribute to nic queue 1 and will be handle by core 1)

SungHoHong2 commented 6 years ago

Yes the problem is seastar also uses RSS and performance increases when I introduce more cores.

I found out there is something wrong with compatibility between Connect X-3 and F-stack because I have also tried i100e Intel NIC and this works fine with F-stack.

The reason is that dpdk's dev_info.reta_size in ff_dpdk_if.c is returned as zero when used with ConnectX-3 and this seems to be problem because the seastar has a if statement for handling this issue.


f-stack - cannot handle the issue when reta_size is returned as zero

       if (dev_info.reta_size) {
            /* reta size must be power of 2 */
            assert((dev_info.reta_size & (dev_info.reta_size - 1)) == 0);

            rss_reta_size[port_id] = dev_info.reta_size;
            printf("port[%d]: rss table size: %d\n", port_id,
                   dev_info.reta_size);
        }

seastar - handles the issue when reta_size is returned as zero

    if (_num_queues > 1) {
        if (_dev_info.reta_size) {
            // RETA size should be a power of 2
            assert((_dev_info.reta_size & (_dev_info.reta_size - 1)) == 0);

            // Set the RSS table to the correct size
            _redir_table.resize(_dev_info.reta_size);
            _rss_table_bits = std::lround(std::log2(_dev_info.reta_size));
            printf("Port %d: RSS table size is %d\n",
                   _port_idx, _dev_info.reta_size);
        } else {
            _rss_table_bits = std::lround(std::log2(_dev_info.max_rx_queues));
        }
whl739 commented 6 years ago

@SungHoHong2 I don't have a Mellanox nic. And F-Stack wasn't tested on Mellanox, all we tested are Intel nics. Could you send a patch for us according to the code of seastar? Thank you.

SungHoHong2 commented 6 years ago

@whl739 Thanks for the reply. I found out the source of the problem

I found out that the current Mellanox supports RSS but it does not return any RSS tables. In case of seastar, seastar creates multiple threads and skips the RSS when there is no RSS table returned by the DPDK.in this case dev_info.reta_size is zero

As far as I know for the F-stack f-stack is hard coded and is dependent on RSS tables for allocating the cores. The problem I am facing could be solved IF there is a way to just increase the number of cores and skip the lookup in RSS table.

Would that be possible in F-stack?

P.S I don't understand the meaning of patch. Would it be possible to define it in more detail? I can show you how the seastar avoids using RSS tables and create multiple threads.

whl739 commented 6 years ago

Patch can be just a pull request. I took a look at seastar's code, and i might find the right way. I'll do some changes, and could you help me to test it?

whl739 commented 6 years ago

I took a look at seastar's code, and i might find the right way. I'll do some changes, and could you help me to test it?

Sorry, after i took a closer look at seastar's code, and found that _rss_table_bits has no effect on the packet receiving with multi cores. There may be other reasons. and i have no idea currently. I'll continue to read seastar's code, and if you have any idea, please let me know.

SungHoHong2 commented 6 years ago

Yes but that is what I needed, skip using rss table when RSS is not available. This will still increase performance because it is using multi threads. I am also looking into seastar. Thanks!