cncf / cnf-testbed

ARCHIVED: 🧪🛏️Cloud-native Network Function (CNF) Testbed --> See LFN Cloud Native Telecom Initiative https://wiki.lfnetworking.org/pages/viewpage.action?pageId=113213592
https://wiki.lfnetworking.org/pages/viewpage.action?pageId=113213592
Apache License 2.0
164 stars 51 forks source link

NSM install fails after fresh cnf testbed setup #349

Closed nicoekkart closed 4 years ago

nicoekkart commented 4 years ago

Hi

I'm setting up a kubernetes cluster using the cnf-testbed on my own virtual test network. My network consists of 3 nodes, a master and a PC, all connected through an overlay vlan. I successfully run make k8s and the k8s cluster seems to function correctly. After following the install instructions for nsm in examples/workload-infra/nsm-k8s/, the vpp-forwarders end up in a crashloopbackoff. I have tried to replicate this setup and constantly get the same outcome: Output of kubectl get pods

nsm-admission-webhook-6cb9df6d7d-hwgxr   1/1     Running            0          136m
nsm-vpp-forwarder-dm2b6                  0/1     CrashLoopBackOff   30         136m
nsm-vpp-forwarder-fc8t8                  0/1     CrashLoopBackOff   30         136m
nsm-vpp-forwarder-qmgnh                  0/1     CrashLoopBackOff   30         136m
nsmgr-9mssd                              2/3     Running            0          136m
nsmgr-f4tc9                              2/3     Running            0          136m
nsmgr-glr6j                              2/3     Running            0          136m

nsmd in nsmgr is not starting because of "waiting for forwarder".

Log of nsm-vpp-forwarder-dm2b6

time="2020-05-10 14:48:48.55223" level=info msg="Starting agent version: v0.0.0-dev" BuildDate= CommitHash= loc="agent/agent.go(134)" logger=agent
time="2020-05-10 14:48:48.55408" level=info msg="Agent started with 2 plugins (took 1ms)" loc="agent/agent.go(179)" logger=agent
ERROR: This binary requires CPU with SSE4.2 extensions.
time="2020-05-10T14:48:48Z" level=info msg="Starting vppagent-forwarder..."
time="2020-05-10T14:48:48Z" level=info msg="Version: 91120f7e"
time="2020-05-10T14:48:48Z" level=info msg="Creating logger from config: &{vppagent-forwarder@node3 false false [] 0xc00008f440 0xc000071bc0 <nil> <nil> <nil>}"
2020/05/10 14:48:48 Initializing logging reporter
time="2020-05-10T14:48:48Z" level=info msg="==--> Start.VPPAgent.Forwarder() span:5a2063ea609c548d:5a2063ea609c548d:0:1"
time="2020-05-10T14:48:48Z" level=info msg="==----> CreateForwarder() span:5a2063ea609c548d:2102f81f101a692c:5a2063ea609c548d:1"
time="2020-05-10T14:48:48Z" level=info msg="==------> createForwartderConfig() span:5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg=">><<------ FORWARDER_NAME=\"vppagent (default value)\" span=5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg=">><<------ FORWARDER_SOCKET=\"/var/lib/networkservicemesh/nsm-vppagent.forwarder.sock (default value)\" span=5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg=">><<------ FORWARDER_SOCKET_TYPE=\"unix (default value)\" span=5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg=">><<------ NSM_BASEDIR=\"/var/lib/networkservicemesh/ (default value)\" span=5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg=">><<------ FORWARDER_REGISTRAR_SOCKET=\"/var/lib/networkservicemesh/nsm.forwarder-registrar.io.sock (default value)\" span=5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg=">><<------ FORWARDER_REGISTRAR_SOCKET_TYPE=\"unix (default value)\" span=5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg=">><<------ METRICS_COLLECTOR_ENABLED=\"false (default value)\" span=5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg="Parts: [Iface Destination Gateway  Flags RefCnt Use Metric Mask  MTU Window IRTT]"
time="2020-05-10T14:48:48Z" level=info msg="Parts: [enp1s5f0 00000000 FD0F020A 0003 0 0 0 00000000 0 0 0]"
time="2020-05-10T14:48:48Z" level=info msg="Found default gateway 10.2.15.253 outgoing: enp1s5f0"
time="2020-05-10T14:48:48Z" level=info msg="INTERFACE: lo : [127.0.0.1/8 ::1/128]"
time="2020-05-10T14:48:48Z" level=info msg="INTERFACE: enp1s5f0 : [10.2.0.121/20 2001:6a8:1d80:2021:230:48ff:fe59:3a68/64 fe80::230:48ff:fe59:3a68/64]"
time="2020-05-10T14:48:48Z" level=info msg="SrcIP: 10.2.0.121, IfaceName: enp1s5f0, SrcIPNet: 10.2.0.121/20" operation=createForwartderConfig span="5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg=">><<------ config= span=5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1"
2020/05/10 14:48:48 Reporting span 5a2063ea609c548d:3215cff2a6a572c3:2102f81f101a692c:1
time="2020-05-10T14:48:48Z" level=info msg="==------> NewServer() span:5a2063ea609c548d:262e280c31c1bae9:2102f81f101a692c:1"
time="2020-05-10T14:48:48Z" level=info msg="GRPC.NewServer with open tracing enabled" operation=NewServer span="5a2063ea609c548d:262e280c31c1bae9:2102f81f101a692c:1"
2020/05/10 14:48:48 Reporting span 5a2063ea609c548d:262e280c31c1bae9:2102f81f101a692c:1
time="2020-05-10T14:48:48Z" level=info msg="vppAgentEndpoint: localhost:9111"
time="2020-05-10T14:48:48Z" level=info msg="CrossConnect - Serve starting..."
time="2020-05-10T14:48:48Z" level=info msg="MonitorServerImpl.AddRecipient: CrossConnect-&{0x9d65f0 0xc0002e6340 map[] 0xc0002b27e0}"
time="2020-05-10T14:48:48Z" level=info msg="Waiting for liveness probe: tcp:localhost:9111"
                                       __
  _  _____  ___ _______ ____ ____ ___ / /_  vpp-agent
 | |/ / _ \/ _ /___/ _ '/ _ '/ -_/ _ / __/  v3.1.0
 |___/ .__/ .__/   \_'_/\_' /\__/_//_\__/   Fri Mar 13 16:25:22 UTC 2020 (58 days ago)
    /_/  /_/           /___/                root@dcaf8a72a565 (go1.13.7 linux/amd64)

time="2020-05-10 14:48:48.60008" level=info msg="Starting agent version: v3.1.0" BuildDate="Fri Mar 13 16:25:22 UTC 2020" CommitHash=8657564 loc="agent/agent.go(134)" logger=agent
time="2020-05-10 14:48:48.60431" level=info msg="ETCD config not found, skip loading this plugin" loc="etcd/plugin_impl_etcd.go(293)" logger=etcd
time="2020-05-10 14:48:48.60463" level=info msg="Consul config not found, skip loading this plugin" loc="consul/plugin.go(138)" logger=consul
time="2020-05-10 14:48:48.60486" level=info msg="Redis config not found, skip loading this plugin" loc="redis/plugin_impl_redis.go(123)" logger=redis
time="2020-05-10 14:48:50.56993" level=info msg="Signal terminated received, stopping." loc="agent/agent.go(196)" logger=agent
time="2020-05-10 14:48:50.57023" level=info msg="Stopping agent" loc="agent/agent.go(269)" logger=agent
time="2020-05-10 14:48:50.57044" level=info msg="stopping programs" loc="supervisor/plugin.go(107)" logger=supervisor
time="2020-05-10 14:48:50.57078" level=info msg="Signal terminated received during agent start, stopping" loc="agent/agent.go(153)" logger=agent
time="2020-05-10 14:48:55.57075" level=error msg="agent failed to stop before timeout (5s)" loc="agent/agent.go(280)" logger=agent

The logs of the vpp-forwarder are not that descriptive, so I have no idea where to look at. Does anybody have an idea what could be wrong?

nickolaev commented 4 years ago

@rdimitrov can you peek here?

michaelspedersen commented 4 years ago

Hi @nicoekkart

Were you using the provided values file for deploying NSM? I just tested it on a K8s v1.17.5 cluster, and here the vppforwarder started without errors.

One thing that I can see is different is the version from the first line of your log:

time="2020-05-10 14:48:48.55223" level=info msg="Starting agent version: v0.0.0-dev" BuildDate= CommitHash= loc="agent/agent.go(134)" logger=agent

Looking at the log from my quick test, it shows:

time="2020-05-11 09:00:39.27306" level=info msg="Starting agent version: v2.3.0" BuildDate="2019-10-07T01:46-07:00" CommitHash=6d18d39033a7c6887cffb1b6524d2b4f06f0433e loc="agent/agent.go(134)" logger=agent

Another (potential) difference is that I am using a cluster where I deployed a vSwitch (make vswitch) which adds CPU isolation and other optimizations (mainly through GRUB) to the node.

nicoekkart commented 4 years ago

Hi @michaelspedersen

I indeed used a different values file, with the master nsm version. Changing it doesn't fix anything though. This is the vpp-forwarder log now.

time="2020-05-11 09:25:39.79827" level=info msg="Starting agent version: v2.3.0" BuildDate="2019-10-07T01:46-07:00" CommitHash=6d18d39033a7c6887cffb1b6524d2b4f06f0433e loc="agent/agent.go(134)" logger=agent
time="2020-05-11 09:25:39.79956" level=info msg="Agent started with 2 plugins (took 1ms)" loc="agent/agent.go(179)" logger=agent
ERROR: This binary requires CPU with SSE4.2 extensions.
time="2020-05-11T09:25:39Z" level=info msg="Starting vppagent-forwarder..."
time="2020-05-11T09:25:39Z" level=info msg="Version: v0.2.0"
time="2020-05-11T09:25:39Z" level=info msg="Creating logger from config: &{vppagent-forwarder@node3 false false [] 0xc00007b740 0xc000071f80 <nil> <nil> <nil>}"
2020/05/11 09:25:39 Initializing logging reporter
time="2020-05-11T09:25:39Z" level=info msg="==--> Start.VPPAgent.Forwarder() span:12c2a650f6f460a0:12c2a650f6f460a0:0:1"
time="2020-05-11T09:25:39Z" level=info msg="==----> CreateForwarder() span:12c2a650f6f460a0:1914912e22f72b6:12c2a650f6f460a0:1"
time="2020-05-11T09:25:39Z" level=info msg="==------> createForwartderConfig() span:12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg=">><<------ FORWARDER_NAME=\"vppagent (default value)\" span=12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg=">><<------ FORWARDER_SOCKET=\"/var/lib/networkservicemesh/nsm-vppagent.forwarder.sock (default value)\" span=12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg=">><<------ FORWARDER_SOCKET_TYPE=\"unix (default value)\" span=12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg=">><<------ NSM_BASEDIR=\"/var/lib/networkservicemesh/ (default value)\" span=12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg=">><<------ FORWARDER_REGISTRAR_SOCKET=\"/var/lib/networkservicemesh/nsm.forwarder-registrar.io.sock (default value)\" span=12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg=">><<------ FORWARDER_REGISTRAR_SOCKET_TYPE=\"unix (default value)\" span=12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg=">><<------ METRICS_COLLECTOR_ENABLED=\"false (default value)\" span=12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg="Parts: [Iface Destination Gateway  Flags RefCnt Use Metric Mask  MTU Window IRTT]"
time="2020-05-11T09:25:39Z" level=info msg="Parts: [enp1s5f0 00000000 FD0F020A 0003 0 0 0 00000000 0 0 0]"
time="2020-05-11T09:25:39Z" level=info msg="Found default gateway 10.2.15.253 outgoing: enp1s5f0"
time="2020-05-11T09:25:39Z" level=info msg="SrcIP: 10.2.0.121, IfaceName: enp1s5f0, SrcIPNet: 10.2.0.121/20" operation=createForwartderConfig span="12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg=">><<------ config=&{vppagent /var/lib/networkservicemesh/ /var/lib/networkservicemesh/nsm.forwarder-registrar.io.sock unix /var/lib/networkservicemesh/nsm-vppagent.forwarder.sock unix <nil> <nil> false 0s 10.2.0.121 0xc000186840 <nil> <nil> <nil>} span=12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1"
2020/05/11 09:25:39 Reporting span 12c2a650f6f460a0:6075448805caae6e:1914912e22f72b6:1
time="2020-05-11T09:25:39Z" level=info msg="==------> NewServer() span:12c2a650f6f460a0:1156fbbcc1177926:1914912e22f72b6:1"
time="2020-05-11T09:25:39Z" level=info msg="GRPC.NewServer with open tracing enabled" operation=NewServer span="12c2a650f6f460a0:1156fbbcc1177926:1914912e22f72b6:1"
2020/05/11 09:25:39 Reporting span 12c2a650f6f460a0:1156fbbcc1177926:1914912e22f72b6:1
time="2020-05-11T09:25:39Z" level=info msg="vppAgentEndpoint: localhost:9111"
time="2020-05-11T09:25:39Z" level=info msg="MonitorServerImpl.AddRecipient: CrossConnect-&{0xc0002072c0 0xc0002787a0 map[] 0xc000186ba0}"
time="2020-05-11T09:25:39Z" level=info msg="Waiting for liveness probe: tcp:localhost:9111"
time="2020-05-11T09:25:39Z" level=info msg="CrossConnect - Serve starting..."
                                      __
 _  _____  ___ _______ ____ ____ ___ / /_
| |/ / _ \/ _ /___/ _ '/ _ '/ -_/ _ / __/
|___/ .__/ .__/   \_'_/\_' /\__/_//_\__/  v2.3.0
   /_/  /_/           /___/

time="2020-05-11 09:25:39.83685" level=info msg="Starting agent version: v2.3.0" BuildDate="2019-10-07T01:46-07:00" CommitHash=6d18d39033a7c6887cffb1b6524d2b4f06f0433e loc="agent/agent.go(134)" logger=agent
time="2020-05-11 09:25:39.83977" level=info msg="ETCD config not found, skip loading this plugin" loc="etcd/plugin_impl_etcd.go(293)" logger=etcd
time="2020-05-11 09:25:39.84001" level=info msg="Consul config not found, skip loading this plugin" loc="consul/plugin.go(137)" logger=consul
time="2020-05-11 09:25:39.84017" level=info msg="Redis config not found, skip loading this plugin" loc="redis/plugin_impl_redis.go(123)" logger=redis
time="2020-05-11 09:25:41.81687" level=info msg="Signal terminated received, stopping." loc="agent/agent.go(196)" logger=agent
time="2020-05-11 09:25:41.81716" level=info msg="Stopping agent" loc="agent/agent.go(269)" logger=agent
time="2020-05-11 09:25:41.81732" level=info msg="stopping programs" loc="supervisor/plugin.go(107)" logger=supervisor
time="2020-05-11 09:25:41.81776" level=info msg="Signal terminated received during agent start, stopping" loc="agent/agent.go(153)" logger=agent
time="2020-05-11 09:25:46.81771" level=error msg="agent failed to stop before timeout (5s)" loc="agent/agent.go(280)" logger=agent

Are there any pointers on how I could try the vswitch without using packet?

michaelspedersen commented 4 years ago

Without details being too clear, it looks like it is VPP failing inside the container (all the information about plugins will usually show up).

In regards to running a vSwitch without Packet, you can look at the roles used in k8s_worker_vswitch_quad_intel.yml In particular:

nicoekkart commented 4 years ago

Thanks @michaelspedersen!

After running just the vpp_install and vpp_container_install steps, all seems to be fine. NSM examples also check out, so looks like it's running correctly! Thanks for the help!