aregm / nff-go

NFF-Go -Network Function Framework for GO (former YANFF)
BSD 3-Clause "New" or "Revised" License
1.38k stars 156 forks source link

Can't run v0.8.1 on both AWS and VMWare - results in RTE_HASH PANIC #649

Closed guesslin closed 4 years ago

guesslin commented 5 years ago

Hi, We just upgrade nff-go to 0.8.1, but it fails on EC2 and VMWare, we got this panic message.

EAL: RTE_HASH tailq is already registered
PANIC in tailqinitfn_rte_hash_tailq():
Cannot initialize tailq: RTE_HASH
6: [/opt/glasnostic/bin/router(_start+0x2a) [0x43ed5a]]
5: [/lib64/libc.so.6(__libc_start_main+0x85) [0x7f6995e42425]]
4: [/opt/glasnostic/bin/router(__libc_csu_init+0x4d) [0x183890d]]
3: [/opt/glasnostic/bin/router() [0x43e19c]]
2: [/opt/glasnostic/bin/router(__rte_panic+0xba) [0x43130e]]
1: [/opt/glasnostic/bin/router(rte_dump_stack+0x18) [0x148d2b8]]
Aborted

We also try to downgrade with v0.8.0 tag, everything is working fine with it.

gshimansky commented 5 years ago

Can you clarify which VM image you are using? I tried v0.8.1 on m5a.4xlarge AWS Ubuntu 18.04 with kernel 4.15.0-1032-aws and ENA NICs and everything works as expected.

gshimansky commented 5 years ago

I also updated kernel and tested on 4.15.0-1048-aws.

Did you update DPDK when you switched between NFF-Go versions?

guesslin commented 5 years ago

@gshimansky we try to run it on m5.xlarge with our customized AMI with kernel 4.4.0-142-generic with ENA NICs, maybe it's the kernel version problem?

gshimansky commented 5 years ago

It could be kernel version although grepping DPDK sources for this error doesn't produce kernel module sources. I am quite sure that key difference between 0.8.0 and 0.8.1 is DPDK version and something inside DPDK 19.08 stopped working in your environment. NFF-Go 0.8.0 used DPDK 19.04.

guesslin commented 4 years ago

@gshimansky I just update the kernel to 4.9.184-0409184-generic which is the same as we build the binary but still failed in the same panic error message

EAL: RTE_HASH tailq is already registered
PANIC in tailqinitfn_rte_hash_tailq():
Cannot initialize tailq: RTE_HASH
6: [/opt/glasnostic/bin/router(_start+0x2a) [0x43ed5a]]
5: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x7f) [0x7fd2cf7007bf]]
4: [/opt/glasnostic/bin/router(__libc_csu_init+0x4d) [0x18388ed]]
3: [/opt/glasnostic/bin/router() [0x43e19c]]
2: [/opt/glasnostic/bin/router(__rte_panic+0xba) [0x43130e]]
1: [/opt/glasnostic/bin/router(rte_dump_stack+0x18) [0x148d298]]
gshimansky commented 4 years ago

Can you also try different gcc version? Looking at DPDK sources I suppose it could be a compiler bug.

guesslin commented 4 years ago

@gshimansky hi, we tried to compile the binary with the following environment

compiler: gcc 5.4.0
nff-go: 0.9.1
DPDK: 19.08
go: go1.10.8 linux/amd64
kernel: 4.9

but our binary still failed with RTE_HASH problem

Oct 04 03:04:23 ip-10-1-218-18 router[5703]: EAL: RTE_HASH tailq is already registered
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: PANIC in tailqinitfn_rte_hash_tailq():
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: Cannot initialize tailq: RTE_HASH
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 6: [/opt/glasnostic/bin/router(_start+0x2a) [0x43dc9a]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 5: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x7a) [0x7f70d0171afa]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 4: [/opt/glasnostic/bin/router(__libc_csu_init+0x4d) [0x182005d]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 3: [/opt/glasnostic/bin/router() [0x43d002]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 2: [/opt/glasnostic/bin/router(__rte_panic+0xb8) [0x430b7f]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 1: [/opt/glasnostic/bin/router(rte_dump_stack+0x18) [0x1488948]]
gshimansky commented 4 years ago

Is there a reason to use gcc of such old version? It was released on June 3, 2016 which is more than 3 years ago. Other than that I still think that this is some problem with DPDK conflicting with your setup. I could however google just a few errors like this with no guide on how to fix them.

marcusschiesser commented 4 years ago

we actually also tried GCC (Debian 8.3.0-6) 8.3.0 before with the same error.

Then we checked that DPDK 19.08 is using gcc 5.4.0 in there CI, see: https://github.com/DPDK/dpdk/blob/v19.08/.travis.yml https://docs.travis-ci.com/user/reference/xenial/#compilers-and-build-toolchain

So I guess it's not a GCC related problem.

gshimansky commented 4 years ago

Do you experience this problem only on your customized AMI? I need to find some configuration where I could reproduce this problem.

gshimansky commented 4 years ago

I tried Amazon Linux 2 AMI (HVM), SSD Volume Type - ami-00c03f7f7f2ec15c3 (64-bit x86) with kernel 4.14.146-119.123.amzn2.x86_64 and gcc Red Hat 7.3.1-6, but DPDK initializes correctly.

guesslin commented 4 years ago

Yes, we have this problem on our customized AMI, the kernel is Linux ip-10-1-218-18 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux based on ami-1ee65166 config-4.4.0-142-generic

aregm commented 4 years ago

If I am not mistaken, there is some functionality in principle missing from 4.4. Have you tried to customize on top of 4.14.146-119.123.amzn2.x86_64?

gshimansky commented 4 years ago

I believe that the problem happens in userspace DPDK code. I cannot say whether it is related to kernel version, but in my understanding it may be related to kernel version only through kernel headers, not through some kernel code directly. I still need some way to reproduce it to debug it.

guesslin commented 4 years ago

@gshimansky While we updated our application to use nff-go v0.9.2, we found this problem is caused by the following patch in our code:

index f13d151b7..944269b1e 100644
--- a/gateway/router/driver/nff/runner_linux.go
+++ b/gateway/router/driver/nff/runner_linux.go
@@ -2,6 +2,10 @@

 package nff

+/*
+#include <rte_ethdev.h>
+*/
+import "C"
 import (
        "fmt"
        "net"
@@ -16,7 +20,6 @@ import (

        "github.com/intel-go/nff-go/devices"
        "github.com/intel-go/nff-go/flow"
-       "github.com/intel-go/nff-go/low"
        libpacket "github.com/intel-go/nff-go/packet"
 )

 func getEthPort(hwaddr net.HardwareAddr) portType {
-       for p := 0; p < low.GetPortsNumber(); p++ {
-               portMACAddress := low.GetPortMACAddress(portType(p))
+       for p := 0; p < int(C.rte_eth_dev_count()); p++ {
+               portMACAddress := flow.GetPortMACAddress(portType(p))

Without this patch it's working, so the problem is caused by including rte_ethdev.h.

As you can see, we want to get the number of device ports by calling rte_eth_dev_count(). In version 0.8.0 we didn't have to do this, because the value was exported in the low package which was moved to internal/low, so we can't access it anymore.

How about creating a flow.GetPortsNumber() function, so we can get the correct number of device ports again?

guesslin commented 4 years ago

@gshimansky I create https://github.com/intel-go/nff-go/pull/680 for this, please have a look :)

gshimansky commented 4 years ago

It is great that you found the cause of this bug! I merged your PR, hope it will allow you to use the latest version of the framework.