cisco-system-traffic-generator / trex-core

trex-core site
https://trex-tgn.cisco.com/
Other
1.28k stars 459 forks source link

"PANIC in rte_eth_dev_shared_data_prepare" beginning with release v2.84 #922

Open norg opened 1 year ago

norg commented 1 year ago

Hi,

I'm currently trying to upgrade an environment with t-rex running. I used v2.81 for quite some time without issues, but I wanted to test the initial support for E810 NICs, so I upgraded to v3.00 but even my old NICs and configs didn't start. So I checked which version was the last known working one and which the first non working. So v2.83 is working for me and v2.84 is the first not working. I didn't find a major change that might be related.

So this is the new error I run into:

./t-rex-64 -f cap2/sfr.yaml -c 4 -m 9  -d 28800000 -p --cfg ../v2.81/trex_cfg_port1.yaml
WARNING: i40e interface 0000:42:00.0 is under DPDK driver and might interfere with current TRex interfaces.
The ports are bound/configured.
Starting  TRex v2.84 please wait  ... 
PANIC in rte_eth_dev_shared_data_prepare():
Cannot allocate ethdev shared data
13: [./_t-rex-64(+0x130651) [0x5623d8989651]]
12: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f4bbd6a809b]]
11: [./_t-rex-64(_Z9main_testiPPc+0x1f4) [0x5623d896ea24]]
10: [./_t-rex-64(rte_eal_init+0x131f) [0x5623d8cc52df]]
9: [./_t-rex-64(rte_bus_probe+0x4e) [0x5623d8ca89ce]]
8: [./_t-rex-64(rte_pci_probe+0x60) [0x5623d8c05300]]
7: [./_t-rex-64(+0x3ac1c8) [0x5623d8c051c8]]
6: [./_t-rex-64(+0x4e806f) [0x5623d8d4106f]]
5: [./_t-rex-64(rte_eth_dev_create+0x42) [0x5623d8ce2432]]
4: [./_t-rex-64(rte_eth_dev_allocate+0x2f) [0x5623d8cd46ef]]
3: [./_t-rex-64(+0x47a1d0) [0x5623d8cd31d0]]
2: [./_t-rex-64(__rte_panic+0xba) [0x5623d89701fb]]
1: [./_t-rex-64(rte_dump_stack+0x18) [0x5623d8cc5d28]]
./t-rex-64: line 100:  5853 Aborted                 ./_$(basename $0) $INPUT_ARGS $EXTRA_INPUT_ARGS

This worked like a charm with v2.83.

The trex config is:

### Config file generated by dpdk_setup_ports.py ###

- version: 2
  interfaces: ['42:00.1', 'dummy']
  zmq_pub_port: 4500
  zmq_rpc_port: 4501
  prefix: setup1
  port_bandwidth_gb: 40
  limit_memory: 9000
  platform:
      master_thread_id: 2
      latency_thread_id: 4
      dual_if:
       - socket: 1
         threads: [5,7,9,11]
         #threads: [5,7,9,11,13,15,17]

DPDK setup output:

./dpdk_setup_ports.py -s

Network devices using DPDK-compatible driver
============================================
0000:42:00.0 'Ethernet Controller XL710 for 40GbE QSFP+' drv=igb_uio unused=i40e,vfio-pci,uio_pci_generic
0000:42:00.1 'Ethernet Controller XL710 for 40GbE QSFP+' drv=igb_uio unused=i40e,vfio-pci,uio_pci_generic

So a rather basic setup. The OS is Debian Buster with kernel 4.19. (I tried the kernel update to 5.10 but I ran into another issue where all hugepages were consumed and in the end T-rex (while starting) ran out of memory. I might create a dedicated issue for that as well.)

Based on the 2.84 release notes I can't see what causes this error. I tried also some versions in between like 2.85, 2.90, 2.99 and 3.00 but all stop with the PANIC.

Does anyone have an idea what might cause this?

norg commented 6 months ago

I tried v3.03 today again on that system and it turns out this option was the issue:

 port_bandwidth_gb: 40

Once I remove this I don't h ave any errors anymore, also the complaint about hugepages is gone. Once I add this option in again the hugepages error shows up again:

 ERROR there is not enough huge-pages memory in your system
EAL: Error - exiting with code: 1
  Cause: Cannot init nodes mbuf pool nodes-0

In the docs it's still optional and explained as

    The bandwidth of each interface in Gbs. In this example we have 10Gbs interfaces. For VM, put 1. Used to tune the amount of memory allocated by TRex.

The last sentence got me, is it not really used anymore and thus buggy?