aregm / nff-go

NFF-Go -Network Function Framework for GO (former YANFF)
BSD 3-Clause "New" or "Revised" License
1.38k stars 154 forks source link

Failed to run NFF-go in AWS EC2 with ena driver #718

Open guesslin opened 3 years ago

guesslin commented 3 years ago

Hi, I have a problem that I can't run nff-go on AWS EC2 instance. I got some error messages from DPDK about the init port failure with the ENA driver.

Oct 08 02:56:02 ip-172-31-41-87 router[18195]: Invalid value for nb_tx_desc(=2048), should be: <= 1024, >= 128, and a product of 1
Oct 08 02:56:02 ip-172-31-41-87 router[18195]: ERROR: Cannot init port  0 !
Oct 08 02:56:02 ip-172-31-41-87 router[18200]: Invalid value for nb_tx_desc(=2048), should be: <= 1024, >= 128, and a 
Full message ``` Oct 08 02:56:01 ip-172-31-41-87 router[18195]: ------------***-------- Initializing DPDK --------***------------ Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: Detected 2 lcore(s) Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: Detected 1 NUMA nodes Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: Multi-process socket /var/run/dpdk/rte/mp_socket Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: Selected IOVA mode 'PA' Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: No available hugepages reported in hugepages-1048576kB Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: Probing VFIO support... Oct 08 02:56:01 ip-172-31-41-87 router[18200]: EAL: Probing VFIO support... Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: PCI device 0000:00:05.0 on NUMA socket -1 Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: Invalid NUMA socket, default to 0 Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: probe driver: 1d0f:ec20 net_ena Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: PCI device 0000:00:06.0 on NUMA socket -1 Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: Invalid NUMA socket, default to 0 Oct 08 02:56:01 ip-172-31-41-87 router[18195]: EAL: probe driver: 1d0f:ec20 net_ena Oct 08 02:56:01 ip-172-31-41-87 router[18200]: EAL: PCI device 0000:00:05.0 on NUMA socket -1 Oct 08 02:56:01 ip-172-31-41-87 router[18200]: EAL: Invalid NUMA socket, default to 0 Oct 08 02:56:01 ip-172-31-41-87 router[18200]: EAL: probe driver: 1d0f:ec20 net_ena Oct 08 02:56:01 ip-172-31-41-87 router[18200]: EAL: PCI device 0000:00:06.0 on NUMA socket -1 Oct 08 02:56:01 ip-172-31-41-87 router[18200]: EAL: Invalid NUMA socket, default to 0 Oct 08 02:56:01 ip-172-31-41-87 router[18200]: EAL: probe driver: 1d0f:ec20 net_ena Oct 08 02:56:02 ip-172-31-41-87 router[18195]: PMD: LLQ is not supported. Fallback to host mode policy. Oct 08 02:56:02 ip-172-31-41-87 router[18195]: PMD: Placement policy: Regular Oct 08 02:56:02 ip-172-31-41-87 router[18200]: PMD: LLQ is not supported. Fallback to host mode policy. Oct 08 02:56:02 ip-172-31-41-87 router[18200]: PMD: Placement policy: Regular Oct 08 02:56:02 ip-172-31-41-87 router[18195]: ------------***------ Initializing scheduler -----***------------ Oct 08 02:56:02 ip-172-31-41-87 router[18195]: DEBUG: Scheduler can use cores: [0 1] Oct 08 02:56:02 ip-172-31-41-87 router[18195]: ------------***---------- Creating ports ---------***------------ Oct 08 02:56:02 ip-172-31-41-87 router[18195]: Invalid value for nb_tx_desc(=2048), should be: <= 1024, >= 128, and a product of 1 Oct 08 02:56:02 ip-172-31-41-87 router[18195]: ERROR: Cannot init port 0 ! Oct 08 02:56:02 ip-172-31-41-87 router[18200]: Invalid value for nb_tx_desc(=2048), should be: <= 1024, >= 128, and a product of 1 ```

I checked with https://github.com/DPDK/dpdk/blob/main/lib/librte_ethdev/rte_ethdev.c#L2019-L2034 generated this error message. And in nff-go/internel/low/low.h, set the nb_tx_desc to 2048.

I tried to reduce TX_RING_SIZE to 1024, but got another warning message but still can't process packets from DPDK flow.

Oct 08 04:55:31 ip-172-31-41-87 router[20230]: WARNING: Can't start new clone for segment1 instance 0
Full message ``` Oct 08 04:55:31 ip-172-31-41-87 router[20230]: ------------***-------- Initializing DPDK --------***------------ Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: Detected 2 lcore(s) Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: Detected 1 NUMA nodes Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: Multi-process socket /var/run/dpdk/rte/mp_socket Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: Selected IOVA mode 'PA' Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: No available hugepages reported in hugepages-1048576kB Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: Probing VFIO support... Oct 08 04:55:31 ip-172-31-41-87 router[20235]: EAL: Probing VFIO support... Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: PCI device 0000:00:05.0 on NUMA socket -1 Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: Invalid NUMA socket, default to 0 Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: probe driver: 1d0f:ec20 net_ena Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: PCI device 0000:00:06.0 on NUMA socket -1 Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: Invalid NUMA socket, default to 0 Oct 08 04:55:31 ip-172-31-41-87 router[20230]: EAL: probe driver: 1d0f:ec20 net_ena Oct 08 04:55:31 ip-172-31-41-87 router[20235]: EAL: PCI device 0000:00:05.0 on NUMA socket -1 Oct 08 04:55:31 ip-172-31-41-87 router[20235]: EAL: Invalid NUMA socket, default to 0 Oct 08 04:55:31 ip-172-31-41-87 router[20235]: EAL: probe driver: 1d0f:ec20 net_ena Oct 08 04:55:31 ip-172-31-41-87 router[20235]: EAL: PCI device 0000:00:06.0 on NUMA socket -1 Oct 08 04:55:31 ip-172-31-41-87 router[20235]: EAL: Invalid NUMA socket, default to 0 Oct 08 04:55:31 ip-172-31-41-87 router[20235]: EAL: probe driver: 1d0f:ec20 net_ena Oct 08 04:55:31 ip-172-31-41-87 router[20230]: PMD: LLQ is not supported. Fallback to host mode policy. Oct 08 04:55:31 ip-172-31-41-87 router[20230]: PMD: Placement policy: Regular Oct 08 04:55:31 ip-172-31-41-87 router[20235]: PMD: LLQ is not supported. Fallback to host mode policy. Oct 08 04:55:31 ip-172-31-41-87 router[20235]: PMD: Placement policy: Regular Oct 08 04:55:31 ip-172-31-41-87 router[20230]: ------------***------ Initializing scheduler -----***------------ Oct 08 04:55:31 ip-172-31-41-87 router[20230]: DEBUG: Scheduler can use cores: [0 1] Oct 08 04:55:31 ip-172-31-41-87 router[20230]: ------------***---------- Creating ports ---------***------------ Oct 08 04:55:31 ip-172-31-41-87 router[20230]: DEBUG: Port 0 MAC address: 06:10:b8:ab:99:db Oct 08 04:55:31 ip-172-31-41-87 router[20230]: ------------***------ Starting FlowFunctions -----***------------ Oct 08 04:55:31 ip-172-31-41-87 router[20230]: DEBUG: Start SCHEDULER at 0 core Oct 08 04:55:31 ip-172-31-41-87 router[20230]: DEBUG: Start STOP at scheduler 0 core Oct 08 04:55:31 ip-172-31-41-87 router[20230]: DEBUG: Start new instance for receiverPort Oct 08 04:55:31 ip-172-31-41-87 router[20230]: 1 Oct 08 04:55:31 ip-172-31-41-87 router[20230]: DEBUG: Start new clone for receiverPort Oct 08 04:55:31 ip-172-31-41-87 router[20230]: 1 instance 0 at 1 core Oct 08 04:55:31 ip-172-31-41-87 router[20230]: DEBUG: Start new instance for segment1 Oct 08 04:55:31 ip-172-31-41-87 router[20230]: WARNING: Can't start new clone for segment1 instance 0 ```

Here's the information about the environment.

driver: ena
version: 2.2.10g
firmware-version:
expansion-rom-version:
bus-info: 0000:00:06.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 172.31.41.87  netmask 255.255.240.0  broadcast 172.31.47.255
        inet6 fe80::4f9:c1ff:fee6:a36f  prefixlen 64  scopeid 0x20<link>
        ether 06:f9:c1:e6:a3:6f  txqueuelen 1000  (Ethernet)
        RX packets 146118  bytes 189039782 (189.0 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 49107  bytes 4985206 (4.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 172.31.47.232  netmask 255.255.240.0  broadcast 172.31.47.255
        inet6 fe80::410:b8ff:feab:99db  prefixlen 64  scopeid 0x20<link>
        ether 06:10:b8:ab:99:db  txqueuelen 1000  (Ethernet)
        RX packets 192  bytes 15232 (15.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 82  bytes 4244 (4.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 3404  bytes 254476 (254.4 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3404  bytes 254476 (254.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:06.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
gshimansky commented 3 years ago

It looks like ENA driver doesn't support this number of TX rings. You can try to pass an appropriate value into arguments of SystemInit https://github.com/intel-go/nff-go/blob/7ff09bf9d84c823f55fc99f770be6ea7ceeedb1c/flow/flow.go#L587

guesslin commented 3 years ago

HI @gshimansky, I tried to pass different values to TXQueuesNumberPerPort in the flow.Config (0 ~ 4), but the problem is still happening.

config := &flow.Config{
        HWTXChecksum:          true,
        TXQueuesNumberPerPort: values,        // XXX: tried [0-8], but didn't change the error message
}
flow.SystemInit(config)
journal logs ``` Oct 13 06:26:05 ip-172-31-41-87 router[2386]: ------------***-------- Initializing DPDK --------***------------ Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: Detected 2 lcore(s) Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: Detected 1 NUMA nodes Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: Multi-process socket /var/run/dpdk/rte/mp_socket Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: Selected IOVA mode 'PA' Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: No available hugepages reported in hugepages-1048576kB Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: Probing VFIO support... Oct 13 06:26:05 ip-172-31-41-87 router[2391]: EAL: Probing VFIO support... Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: PCI device 0000:00:05.0 on NUMA socket -1 Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: Invalid NUMA socket, default to 0 Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: probe driver: 1d0f:ec20 net_ena Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: PCI device 0000:00:06.0 on NUMA socket -1 Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: Invalid NUMA socket, default to 0 Oct 13 06:26:05 ip-172-31-41-87 router[2386]: EAL: probe driver: 1d0f:ec20 net_ena Oct 13 06:26:05 ip-172-31-41-87 router[2391]: EAL: PCI device 0000:00:05.0 on NUMA socket -1 Oct 13 06:26:05 ip-172-31-41-87 router[2391]: EAL: Invalid NUMA socket, default to 0 Oct 13 06:26:05 ip-172-31-41-87 router[2391]: EAL: probe driver: 1d0f:ec20 net_ena Oct 13 06:26:05 ip-172-31-41-87 router[2391]: EAL: PCI device 0000:00:06.0 on NUMA socket -1 Oct 13 06:26:05 ip-172-31-41-87 router[2391]: EAL: Invalid NUMA socket, default to 0 Oct 13 06:26:05 ip-172-31-41-87 router[2391]: EAL: probe driver: 1d0f:ec20 net_ena Oct 13 06:26:05 ip-172-31-41-87 router[2386]: PMD: LLQ is not supported. Fallback to host mode policy. Oct 13 06:26:05 ip-172-31-41-87 router[2386]: PMD: Placement policy: Regular Oct 13 06:26:05 ip-172-31-41-87 router[2391]: PMD: LLQ is not supported. Fallback to host mode policy. Oct 13 06:26:05 ip-172-31-41-87 router[2391]: PMD: Placement policy: Regular Oct 13 06:26:05 ip-172-31-41-87 router[2386]: ------------***------ Initializing scheduler -----***------------ Oct 13 06:26:05 ip-172-31-41-87 router[2386]: DEBUG: Scheduler can use cores: [0 1] Oct 13 06:26:05 ip-172-31-41-87 router[2386]: ------------***---------- Creating ports ---------***------------ Oct 13 06:26:05 ip-172-31-41-87 router[2386]: Invalid value for nb_tx_desc(=2048), should be: <= 1024, >= 128, and a product of 1 Oct 13 06:26:05 ip-172-31-41-87 router[2386]: ERROR: Cannot init port 0 ! Oct 13 06:26:05 ip-172-31-41-87 router[2391]: Invalid value for nb_tx_desc(=2048), should be: <= 1024, >= 128, and a product of 1 ```

If I set TXQueuesNumberPerPort to 1024, there's another error message about the TX queues.

Oct 13 07:04:23 ip-172-31-41-87 router[3394]: Warning! Port 0 does not support requested number of TX queues 1024. Setting number of TX queues to 8

It seems to be the problem caused by the tx_ring_size in ENA driver can't support the value (2048) from DPDK https://github.com/intel-go/nff-go/blob/v0.9.2/internal/low/low.h#L37-L39

Is there any way I can configure the TX_RING_SIZE correctly?

Update

I tried again to set TX_RING_SIZE with patch and increase the EC2 instance type from t3.large to t3.xlarge and I can run the nff-go

Performance with iperf3 ``` [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 2.98 GBytes 2.56 Gbits/sec 5 sender [ 4] 0.00-10.00 sec 2.98 GBytes 2.56 Gbits/sec receiver ```