esnet / zeek-dpdk

11 stars 6 forks source link

worker crashed when lb_procs > 1 #2

Open FANGOD opened 2 years ago

FANGOD commented 2 years ago

Hi ~

I have a problem.

zeek-4.1.1 dpdk-18.11

node.cfg

[manager]
type=manager
host=localhost

[proxy-1]
type=proxy
host=localhost

[worker-eth5]
type=worker
host=localhost
# Change eth0 to match your capture interface
interface=dpdk::eth5
# Change based on the number of cores you want to dedicate to worker processes
lb_procs=4  # 1: works well, >1: crashed
lb_method=custom

lb_procs = 1 works well

lb_procs = 2

EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_117710_196f346d3e76758
EAL: Selected IOVA mode 'PA'
EAL: Probing VFIO support...
EAL: Probe PCI driver: net_i40e (8086:37d2) device: 0000:3d:00.1 (socket 0)
EAL: No legacy callbacks, legacy socket not created
Configuring DPDK port 0, queue 0/3
RING: Cannot reserve memory
EAL: Error - exiting with code: 1
  Cause: Cannot create receive ring: File exists

EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No available hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: Probe PCI driver: net_i40e (8086:37d2) device: 0000:3d:00.1 (socket 0)
EAL: No legacy callbacks, legacy socket not created
Monitoring DPDK port 0, queue 1, core 0

lb_procs = 4

Zeek 4.1.1
Linux 3.10.0-1062.9.1.el7.x86_64

Zeek plugins:
ESnet::DPDK - DPDK packet source plugin (dynamic, version 0.1.0)
Seiso::Kafka - Writes logs to Kafka (dynamic, version 0.3.0)

Core file: core.162604
...
[New LWP 162721]
[New LWP 162707]
[New LWP 162703]
[New LWP 162708]
[New LWP 162725]
[New LWP 162704]
[New LWP 162730]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/zeek/bin/zeek -i dpdk::eth5 -U .status -p zeekctl -p zeekctl-live -p local'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007faf8206788b in i40e_dev_stats_get () from /usr/local/lib64/dpdk/pmds-21.0/librte_net_i40e.so.21.0

Thread 52 (Thread 0x7faed4ff9700 (LWP 162730)):
#0  0x00007fafad1939dd in accept () from /lib64/libpthread.so.0
#1  0x00007fafa9eb864b in socket_listener () from /usr/local/lib64/librte_telemetry.so.21
#2  0x00007fafad18cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fafac6959fd in clone () from /lib64/libc.so.6

...

Thread 3 (Thread 0x7fafa1ffb700 (LWP 162630)):
#0  0x00007fafad190a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fafacf2eaec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x0000000000e6917f in caf::detail::private_thread::await() ()
#3  0x0000000000e69258 in caf::detail::private_thread::run(caf::actor_system*) ()
#4  0x0000000000e69302 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::thread caf::actor_system::launch_thread<caf::detail::private_thread::launch(caf::actor_system*)::{lambda()#1}>(char const*, caf::detail::private_thread::launch(caf::actor_system*)::{lambda()#1})::{lambda(char const*)#1}, caf::intrusive_ptr<caf::ref_counted> > > >::_M_run() ()
#5  0x0000000000f4864f in execute_native_thread_routine ()
#6  0x00007fafad18cea5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fafac6959fd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7faf71259700 (LWP 162694)):
#0  0x00007fafad19375d in read () from /lib64/libpthread.so.0
#1  0x00007fafaa9b2b20 in eal_thread_loop () from /usr/local/lib64/librte_eal.so.21
#2  0x00007fafad18cea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fafac6959fd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fafae9d93c0 (LWP 162604)):
#0  0x00007faf8206788b in i40e_dev_stats_get () from /usr/local/lib64/dpdk/pmds-21.0/librte_net_i40e.so.21.0
#1  0x00007fafaa6e36ba in rte_eth_stats_get () from /usr/local/lib64/librte_ethdev.so.21
#2  0x00007fafae9f328a in zeek::iosource::DPDK::Statistics (this=<optimized out>, stats=0x7ffef3c1a0c0) at /home/lianpengcheng/work/zeek-dpdk/src/DPDK.cc:376
#3  0x000000000089cc61 in zeek::BifFunc::get_net_stats_bif(zeek::detail::Frame*, std::vector<zeek::IntrusivePtr<zeek::Val>, std::allocator<zeek::IntrusivePtr<zeek::Val> > > const*) ()
#4  0x00000000008a7920 in zeek::detail::BuiltinFunc::Invoke(std::vector<zeek::IntrusivePtr<zeek::Val>, std::allocator<zeek::IntrusivePtr<zeek::Val> > >*, zeek::detail::Frame*) const ()
#5  0x000000000086c0fc in zeek::detail::CallExpr::Eval(zeek::detail::Frame*) const ()
#6  0x000000000086bba6 in zeek::detail::eval_list(zeek::detail::Frame*, zeek::detail::ListExpr const*) ()
#7  0x000000000086bedd in zeek::detail::ScheduleExpr::Eval(zeek::detail::Frame*) const ()
#8  0x000000000091061c in zeek::detail::ExprStmt::Exec(zeek::detail::Frame*, zeek::detail::StmtFlowType&) ()
#9  0x0000000000910597 in zeek::detail::IfStmt::DoExec(zeek::detail::Frame*, zeek::Val*, zeek::detail::StmtFlowType&) ()
#10 0x000000000091063c in zeek::detail::ExprStmt::Exec(zeek::detail::Frame*, zeek::detail::StmtFlowType&) ()
#11 0x00000000009111c2 in zeek::detail::StmtList::Exec(zeek::detail::Frame*, zeek::detail::StmtFlowType&) ()
#12 0x00000000008b07a9 in zeek::detail::ScriptFunc::Invoke(std::vector<zeek::IntrusivePtr<zeek::Val>, std::allocator<zeek::IntrusivePtr<zeek::Val> > >*, zeek::detail::Frame*) const ()
#13 0x00000000008551ae in zeek::EventHandler::Call(std::vector<zeek::IntrusivePtr<zeek::Val>, std::allocator<zeek::IntrusivePtr<zeek::Val> > >*, bool) ()
#14 0x0000000000854426 in zeek::Event::Dispatch(bool) ()
#15 0x0000000000854974 in zeek::EventMgr::Drain() ()
#16 0x0000000000827a61 in zeek::detail::setup(int, char**, zeek::Options*) ()
#17 0x0000000000722ca1 in main ()

EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_162604_1970a432dd5f740
EAL: Selected IOVA mode 'PA'
EAL: Probing VFIO support...
EAL: Probe PCI driver: net_i40e (8086:37d2) device: 0000:3d:00.1 (socket 0)
EAL: No legacy callbacks, legacy socket not created
Configuring DPDK port 0, queue 0/5
listening on eth5

EAL: Probe PCI driver: net_i40e (8086:37d2) device: 0000:3d:00.1 (socket 0)
EAL: Fail to recv reply for request /var/run/dpdk/rte/mp_socket_162604_1970a432dd5f740:mp_malloc_sync
EAL: Could not send sync request to secondary process
HASH: memory allocation failed
i40e_init_fdir_filter_list(): Failed to create fdir hash table!
ethdev initialisation failed
EAL: Requested device 0000:3d:00.1 cannot be used
EAL: No legacy callbacks, legacy socket not created
fatal error: Error: no ports found
...

EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_162610_1970a45cce3f312
EAL: failed to send to (/var/run/dpdk/rte/mp_socket) due to Connection refused
EAL: Fail to send request /var/run/dpdk/rte/mp_socket:bus_vdev_mp
vdev_scan(): Failed to request vdev from primary
EAL: Selected IOVA mode 'PA'
EAL: Probing VFIO support...
EAL: Cannot find resource for device
EAL: No legacy callbacks, legacy socket not created
fatal error: Error: no ports found

I guess the memzone_reserve was wrong, but didn't find anything.

Any useful information would be greatly appreciated.

grigorescu commented 2 years ago

I think I've fixed this; can you try the latest version?

Thanks!