Nordix / Meridio

Facilitator of attraction and distribution of external traffic within Kubernetes via secondary networks
https://meridio.nordix.org
Apache License 2.0
46 stars 9 forks source link

Old NSM interfaces might remain in proxy after OOM kill #407

Open zolug opened 1 year ago

zolug commented 1 year ago

Describe the bug When proxy process gets killed by OOM, there's no chance for the proxy to close them obviously. However, NSM might not be able to clean them up either when the MaxTokenLifetime expires due to rpc error: code = PermissionDenied desc = no sufficient privileges.

The old interfaces do not seem to cause any problem, but they are confusing. (Their IPs are also leaked.) The connected peer interface in LB also remains. (and can be used to reach the proxy).

To Reproduce Steps to reproduce the behavior:

  1. Kill a proxy process using KILL signal. (For example from its worker.)
  2. Wait the container to recover and check the NSM interfaces towards LBs.
  3. Wait MaxTokenLifetime (10 minutes by default) and check the interfaces agains. If they are present check the logs of the local nsmgr or forwarder for the Close.

Expected behavior The interfaces should be event removed (probably by NSM). (Resolving the leaked IP issue sound much more complex.)

Context

Logs

Apr 20 16:29:35.434ESC[37m [TRAC] [id:4967924d-fc4b-42df-b071-20824f7fa158] [type:networkService] ESC[0m(2.1)    close={"id":"4967924d-fc4b-42df-b071-20824f7fa158","network_service":"load-balancer.trench-a.default","mechanism":{"cls":"LOCAL","type":"KERNEL","parameters":{"inodeURL":"inode://4/4026532326","name":"load-balan-fb82"}},"context":{"ip_context":{"src_ip_addrs":["172.16.3.3/24","fd00:0:0:3::3/64"],"dst_ip_addrs":["172.16.3.2/24","fd00:0:0:3::2/64"],"excluded_prefixes":["11.0.0.0/22","11.0.4.0/23","11.0.6.0/24","12.0.0.1/32","12.0.27.146/32","12.0.116.178/32","12.0.155.217/32","12.0.184.219/32","12.0.201.243/32","12.0.250.201/32"]},"MTU":1446},"labels":{"nodeName":"vm-002"},"path":{"index":2,"path_segments":[{"name":"proxy-trench-a-cfqvk","id":"fb82d6e2-ecaf-496e-8072-0c90da0ddfd1-proxy-trench-a-cfqvk-load-balancer.trench-a.default-load-balancer-trench-a-7c885b6d9b-xgvx8","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTY4MjAwODE3NH0.0QtsWuA3vEKWj0dEwIkkCgPu5Rf9uBD5hzT50GAqnUcWrQ0P_XfEZyCbX-Z7l-Bzzg_S0tiyAp0pcDVrVg-CFw","expires":{"seconds":1682008174,"nanos":455339069}},{"name":"nsmgr-tr5n7","id":"44a7009f-5a59-4329-a253-38ae6ccd93d5","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNjgyMDA4MTc1fQ.exoviVXnqYh4aYdzm1Qx9JpxVS5mYlMYzDQaPs7RiMjm7ilyy9rJJAQLhY258EHpd1gUa6vlqy5LfD9-bJqc2Q","expires":{"seconds":1682008175,"nanos":429149769}},{"name":"forwarder-vpp-thdt2","id":"4967924d-fc4b-42df-b071-20824f7fa158","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiXSwiZXhwIjoxNjgyMDA4MTc2fQ.kcaJ8HkJmLu5A9Q02R7ACiolZ6JUS00Vcuzxt1MgC5_fq4Vc-ob57E-yQTqDky9suZjjlMrevtJWw9fLX4pwfg","expires":{"seconds":1682008176,"nanos":273042269},"metrics":{"client_drops":"0","client_rx_bytes":"0","client_rx_packets":"0","client_tx_bytes":"0","client_tx_packets":"0","server_drops":"0","server_rx_bytes":"0","server_rx_packets":"0","server_tx_bytes":"0","server_tx_packets":"0"}},{"name":"nsmgr-27j6j","id":"ddfd106c-e964-4f9f-bb54-9b6e1912a0f8","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNjgyMDA4MTc3fQ.8BBI4hhSXXw-yoXJwSsNP3O08x9Q4TOLAuS_oTOnu37ZtjnFo3DbC6nmAJCUpSrfEOQesq8Nk2MhFeLqrKZrWA","expires":{"seconds":1682008177,"nanos":830338082}},{"name":"forwarder-vpp-cp8sp","id":"eebdbac2-5393-4cd1-a148-6aa2fb41544d","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL21lcmlkaW8tYXV0aC10cmVuY2gtYSJdLCJleHAiOjE2ODIwMDgxNzh9.0FEurb6Wm_U1OXlwGNQ7CEHWHwx1uB37m3I_oXYotGckctlmfYihBcNDlPvvX3LKWGKbjzihbf8CQfYvHhfOqw","expires":{"seconds":1682008178,"nanos":249942792},"metrics":{"client_drops":"0","client_rx_bytes":"0","client_rx_packets":"0","client_tx_bytes":"0","client_tx_packets":"0","server_drops":"0","server_rx_bytes":"0","server_rx_packets":"0","server_tx_bytes":"0","server_tx_packets":"0"}},{"name":"load-balancer-trench-a-7c885b6d9b-xgvx8","id":"30958abe-2bde-4f45-83a0-e67fa66d801b","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL21lcmlkaW8tYXV0aC10cmVuY2gtYSIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCJdLCJleHAiOjE2ODIwMDgxNzh9.5q5UYPUNFCV5aw-4yPSmG0sjGNW9JhsdXTlq_xO2wjVsV5Bcx1EMSNCevk5PBtLNthIg3O152flg6lLUjOQFZA","expires":{"seconds":1682008178,"nanos":250666242}}]},"network_service_endpoint_name":"load-balancer-trench-a-7c885b6d9b-xgvx8","payload":"ETHERNET"}
Apr 20 16:29:35.434ESC[37m [TRAC] [id:4967924d-fc4b-42df-b071-20824f7fa158] [type:networkService] ESC[0m(3)   ⎆ sdk/pkg/networkservice/common/updatetoken/updateTokenServer.Close()
Apr 20 16:29:35.434ESC[37m [TRAC] [id:4967924d-fc4b-42df-b071-20824f7fa158] [type:networkService] ESC[0m(3.1)     close-diff={"path":{"path_segments":{"2":{"expires":{"nanos":434310849,"seconds":1682008775},"token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiXSwiZXhwIjoxNjgyMDA4Nzc1fQ.nbjavxp1tsSZxvPiiKISnXo8WFsVmRWhX9AL7-ZuSLEpPjtfh8Xpddia3SKjjL7hDWS6hAlMOpT1NTXqM8s0Zg"}}}}
Apr 20 16:29:35.434ESC[37m [TRAC] [id:4967924d-fc4b-42df-b071-20824f7fa158] [type:networkService] ESC[0m(4)    ⎆ sdk/pkg/networkservice/common/authorize/authorizeServer.Close()
Apr 20 16:29:35.435ESC[31m [ERRO] [id:4967924d-fc4b-42df-b071-20824f7fa158] [type:networkService] ESC[0m(4.1)      rpc error: code = PermissionDenied desc = no sufficient privileges;  Error returned from sdk/pkg/networkservice/common/authorize/authorizeServer.Close;      github.com/networkservicemesh/sdk/pkg/networkservice/core/trace.logError;               /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/trace/common.go:206;  github.com/networkservicemesh/sdk/pkg/networkservice/core/trace.(*beginTraceServer).Close;              /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/trace/server.go:71;   github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/next/server.go:65;    github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/next/server.go:65;    github.com/networkservicemesh/sdk/pkg/networkservice/core/trace.(*endTraceServer).Close;                /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/trace/server.go:86;   github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/next/server.go:65;    github.com/networkservicemesh/sdk/pkg/networkservice/common/updatetoken.(*updateTokenServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/common/updatetoken/server.go:84;   github.com/networkservicemesh/sdk/pkg/networkservice/core/trace.(*beginTraceServer).Close;              /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/trace/server.go:69;   github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/next/server.go:65;    github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/next/server.go:65;    github.com/networkservicemesh/sdk/pkg/networkservice/core/trace.(*endTraceServer).Close;                /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/trace/server.go:86;   github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/core/next/server.go:65;    github.com/networkservicemesh/sdk/pkg/networkservice/common/begin.(*eventFactoryServer).Close.func1;            /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.6.2/pkg/networkservice/common/begin/event_factory.go:225; github.com/edwarnicke/serialize.(*Executor).process;            /root/go/pkg/mod/github.com/edwarnicke/serialize@v1.0.7/serialize.go:68;        runtime.goexit;         /go/src/runtime/asm_amd64.s:1571;       
Apr 20 16:29:35.435ESC[37m [TRAC] [id:4967924d-fc4b-42df-b071-20824f7fa158] [type:networkService] ESC[0m(3.2)     close-response-diff={"id":"4967924d-fc4b-42df-b071-20824f7fa158","path":{"index":2,"path_segments":{"2":{"expires":{"nanos":434310849,"seconds":1682008775},"token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiXSwiZXhwIjoxNjgyMDA4Nzc1fQ.nbjavxp1tsSZxvPiiKISnXo8WFsVmRWhX9AL7-ZuSLEpPjtfh8Xpddia3SKjjL7hDWS6hAlMOpT1NTXqM8s0Zg"}}}}
Apr 20 16:29:35.435ESC[31m [ERRO] [id:4967924d-fc4b-42df-b071-20824f7fa158] [type:networkService] ESC[0m(3.3)     Error returned from sdk/pkg/networkservice/common/authorize/authorizeServer.Close: rpc error: code = PermissionDenied desc = no sufficient privileges
Apr 20 16:29:35.854ESC[37m [TRAC] [id:5ae244a3-d784-4ec5-b3b4-0d5951b1b7db] [type:networkService] ESC[0m(2.1)    close={"id":"5ae244a3-d784-4ec5-b3b4-0d5951b1b7db","network_service":"load-balancer.trench-a.default","mechanism":{"cls":"LOCAL","type":"KERNEL","parameters":{"inodeURL":"inode://4/4026532326","name":"load-balan-0515"}},"context":{"ip_context":{"src_ip_addrs":["172.16.3.5/24","fd00:0:0:3::5/64"],"dst_ip_addrs":["172.16.3.4/24","fd00:0:0:3::4/64"],"excluded_prefixes":["11.0.0.0/22","11.0.4.0/23","11.0.6.0/24","12.0.0.1/32","12.0.27.146/32","12.0.116.178/32","12.0.155.217/32","12.0.184.219/32","12.0.201.243/32","12.0.250.201/32"]},"MTU":1500},"labels":{"nodeName":"vm-002"},"path":{"index":2,"path_segments":[{"name":"proxy-trench-a-cfqvk","id":"051579bb-4ced-459c-bd02-e17721370210-proxy-trench-a-cfqvk-load-balancer.trench-a.default-load-balancer-trench-a-7c885b6d9b-k7fnk","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTY4MjAwODE3NH0.uNZ4KPC2NZbqHDIxpXFbLhxxXSWhkZ1VbD3s3k4yVr86mMP0w24Vw7DsWmVyIv3ywDMMqUIHyOvA2tHnIdY7UQ","expires":{"seconds":1682008174,"nanos":472660529}},{"name":"nsmgr-tr5n7","id":"3902abe8-3678-43b9-bcbf-569c716a8685","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNjgyMDA4MTc1fQ.1FuetkXxojB8W_CsYdydm3Bswz_e27-ttZoTvfCL_lRQiHDvCQtvOcBhitrIRq64snn8dfMSWw_p6LoNwGCkAw","expires":{"seconds":1682008175,"nanos":852200649}},{"name":"forwarder-vpp-thdt2","id":"5ae244a3-d784-4ec5-b3b4-0d5951b1b7db","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL21lcmlkaW8tYXV0aC10cmVuY2gtYSJdLCJleHAiOjE2ODIwMDgxNzZ9.LNIU7MDeu7AJtRjU2n5xPRa4Q-zn_uagGjQ5NQTe3oSrbLf-r_Mw6AvnzR_Z_3izq77Yq4FEu3P9MxmMjvnFbg","expires":{"seconds":1682008176,"nanos":643886529},"metrics":{"client_drops":"0","client_rx_bytes":"0","client_rx_packets":"0","client_tx_bytes":"0","client_tx_packets":"0","server_drops":"0","server_rx_bytes":"0","server_rx_packets":"0","server_tx_bytes":"0","server_tx_packets":"0"}},{"name":"load-balancer-trench-a-7c885b6d9b-k7fnk","id":"db2bd643-5505-4c38-9704-7eb194b51f44","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL21lcmlkaW8tYXV0aC10cmVuY2gtYSIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCJdLCJleHAiOjE2ODIwMDgxNzZ9.m0FIfhJPKfomZ6eXTBBWef7gjuc1xddGLfo959MFBg4zvNapP2Ephv-sYKVSrcunCRzE2uU88cyTdU0CkkXMFw","expires":{"seconds":1682008176,"nanos":644706639}}]},"network_service_endpoint_name":"load-balancer-trench-a-7c885b6d9b-k7fnk","payload":"ETHERNET"}
zolug commented 2 months ago

There were improvements in NSM related to cleaning up connections. Haven't noticed "PersmissionDenied" error for a while.

There were also improvements in Meridio: -IP garbage collection for Meridio IPAM: https://github.com/Nordix/Meridio/issues/527 -Proxy can restore previous connections to LBs by learning them from NSMgr using NSM's Monitor Connection feature: https://github.com/Nordix/Meridio/pull/536 https://github.com/Nordix/Meridio/issues/542

Run verification tests with most recent Meridio and NSM >= v.1.13.2