Nordix / Meridio

Facilitator of attraction and distribution of external traffic within Kubernetes via secondary networks
https://meridio.nordix.org
Apache License 2.0
46 stars 9 forks source link

Xcluster; forwarder-vpp in nsm v1.8.0 crashes #388

Open uablrek opened 1 year ago

uablrek commented 1 year ago

Describe the bug

One trench usually works, but when starting more trenches the forwarder-vpp on the nodes where the Meridio LB is deployed dives into a crash-loop.

I saw OOM-kill once when monitoring with "watch", but I am very unsure that's the root cause

Expected behavior

No crash

Context

Logs

The last entries before re-start below:

Mar  3 07:44:54.488 [ERRO] [cmd:/bin/forwarder] signal: killed
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=168 context=22020097 data_len=19 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=4fa28a85 msg_id=816 msg_name=sw_interface_set_l2_xconnect seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=169 context=22151169 data_len=24 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=d0678b13 msg_id=818 msg_name=sw_interface_set_l2_bridge seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=170 context=22282241 data_len=74 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=863fa648 msg_id=1302 msg_name=af_packet_delete seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=171 context=22413313 data_len=19 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=4fa28a85 msg_id=816 msg_name=sw_interface_set_l2_xconnect seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=172 context=22544385 data_len=74 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=863fa648 msg_id=1302 msg_name=af_packet_delete seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=173 context=22675457 data_len=19 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=4fa28a85 msg_id=816 msg_name=sw_interface_set_l2_xconnect seq_num=1
time="2023-03-03T07:44:54Z" level=warning msg="Unable to send message" channel=174 context=22806529 data_len=138 error="write unix @->/var/run/vpp/api.sock: write: broken pipe" is_multi=false msg_crc=2457116d msg_id=1373 msg_name=add_node_next seq_num=1
Mar  3 07:44:54.513 [ERRO] [cmd:/bin/forwarder] <nil>

I interpret the "signal: killed" as a failed probe which also is indicated by;

# kubectl describe pod forwarder-vpp-kksqq
...
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  8m4s                 default-scheduler  Successfully assigned default/forwarder-vpp-kksqq to vm-002
  Normal   Pulling    8m4s                 kubelet            Pulling image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0"
  Normal   Pulled     8m1s                 kubelet            Successfully pulled image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0" in 3.191323307s (3.191332663s including waiting)
  Warning  Unhealthy  97s (x3 over 7m59s)  kubelet            Readiness probe failed: timeout: failed to connect service "unix:///listen.on.sock" within 1s
  Normal   Created    58s (x4 over 8m)     kubelet            Created container forwarder-vpp
  Normal   Started    58s (x4 over 8m)     kubelet            Started container forwarder-vpp
  Normal   Pulled     58s (x3 over 2m5s)   kubelet            Container image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0" already present on machine
  Warning  Unhealthy  57s                  kubelet            Readiness probe failed: command timed out
  Warning  BackOff    44s (x9 over 115s)   kubelet            Back-off restarting failed container forwarder-vpp in pod forwarder-vpp-kksqq_default(afab3247-3397-415d-a4a1-f943c279a981)

Even when everything is OK (one trench) there are error printouts in the forwarder-vpp log. The problem seem to be transient though;

Mar  3 07:39:26.091 [ERRO] [cmd:/bin/forwarder] [component:netNsMonitor] [inodeURL:inode://4/4026532665] inode 4026532665 is not found in /proc
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:39:26Z" level=error msg="Channel ID not known, ignoring the message." channel=132 msg_id=24
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:41:15Z" level=error msg="Channel ID not known, ignoring the message." channel=176 msg_id=24

I also got an enormous stack trace once.

uablrek commented 1 year ago

Altering the liveness/readiness timeouts did not help. Not even removing the probes. Forwarder-vpp still crashes with "signal: killed". Since no probe fails, I don't know where the signal comes from.