I interpret the "signal: killed" as a failed probe which also is indicated by;
# kubectl describe pod forwarder-vpp-kksqq
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m4s default-scheduler Successfully assigned default/forwarder-vpp-kksqq to vm-002
Normal Pulling 8m4s kubelet Pulling image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0"
Normal Pulled 8m1s kubelet Successfully pulled image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0" in 3.191323307s (3.191332663s including waiting)
Warning Unhealthy 97s (x3 over 7m59s) kubelet Readiness probe failed: timeout: failed to connect service "unix:///listen.on.sock" within 1s
Normal Created 58s (x4 over 8m) kubelet Created container forwarder-vpp
Normal Started 58s (x4 over 8m) kubelet Started container forwarder-vpp
Normal Pulled 58s (x3 over 2m5s) kubelet Container image "ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.8.0" already present on machine
Warning Unhealthy 57s kubelet Readiness probe failed: command timed out
Warning BackOff 44s (x9 over 115s) kubelet Back-off restarting failed container forwarder-vpp in pod forwarder-vpp-kksqq_default(afab3247-3397-415d-a4a1-f943c279a981)
Even when everything is OK (one trench) there are error printouts in the forwarder-vpp log. The problem seem to be transient though;
Mar 3 07:39:26.091 [ERRO] [cmd:/bin/forwarder] [component:netNsMonitor] [inodeURL:inode://4/4026532665] inode 4026532665 is not found in /proc
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:39:26Z" level=error msg="Channel ID not known, ignoring the message." channel=132 msg_id=24
time="2023-03-03T07:39:26Z" level=info msg="No subscription found for the notification message." msg_id=701 msg_size=19
time="2023-03-03T07:41:15Z" level=error msg="Channel ID not known, ignoring the message." channel=176 msg_id=24
Altering the liveness/readiness timeouts did not help. Not even removing the probes. Forwarder-vpp still crashes with "signal: killed". Since no probe fails, I don't know where the signal comes from.
Describe the bug
One trench usually works, but when starting more trenches the
forwarder-vpp
on the nodes where the Meridio LB is deployed dives into a crash-loop.I saw OOM-kill once when monitoring with "watch", but I am very unsure that's the root cause
Expected behavior
No crash
Context
Logs
The last entries before re-start below:
I interpret the "signal: killed" as a failed probe which also is indicated by;
Even when everything is OK (one trench) there are error printouts in the
forwarder-vpp
log. The problem seem to be transient though;I also got an enormous stack trace once.