Open jrajahalme opened 1 month ago
Full agent logs: logs-cilium-x8hcj-cilium-agent-20241003-071420.log Full envoy logs: logs-cilium-envoy-kgzx9-cilium-envoy-20241003-071420.log
The affected worker thread [20]
is doing something between 07:14:02 and 07:14:22:
2024-10-03T07:14:02.124384806Z [2024-10-03 07:14:02.124][20][debug][connection] [external/envoy/source/common/network/connection_impl.cc:278] [Tags: "ConnectionId":"128"] closing socket: 1
2024-10-03T07:14:12.357880125Z [2024-10-03 07:14:12.357][20][trace][connection] [external/envoy/source/common/network/connection_impl.cc:469] [Tags: "ConnectionId":"128"] raising connection event 1
2024-10-03T07:14:12.357898981Z [2024-10-03 07:14:12.357][20][trace][conn_handler] [external/envoy/source/common/listener_manager/active_stream_listener_base.cc:126] [Tags: "ConnectionId":"128"] tcp connection on event 1
2024-10-03T07:14:12.357904441Z [2024-10-03 07:14:12.357][20][debug][conn_handler] [external/envoy/source/common/listener_manager/active_stream_listener_base.cc:136] [Tags: "ConnectionId":"128"] adding to cleanup list
2024-10-03T07:14:12.357909881Z [2024-10-03 07:14:12.357][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:228] item added to deferred deletion list (size=1)
2024-10-03T07:14:12.357915311Z [2024-10-03 07:14:12.357][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:228] item added to deferred deletion list (size=2)
2024-10-03T07:14:12.357920881Z [2024-10-03 07:14:12.357][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:122] clearing deferred deletion list (size=2)
2024-10-03T07:14:12.358216078Z [2024-10-03 07:14:12.357][20][debug][connection] [external/envoy/source/common/network/connection_impl.cc:146] [Tags: "ConnectionId":"132"] closing data_to_write=0 type=1
2024-10-03T07:14:12.358227920Z [2024-10-03 07:14:12.358][20][debug][connection] [external/envoy/source/common/network/connection_impl.cc:278] [Tags: "ConnectionId":"132"] closing socket: 1
2024-10-03T07:14:22.599030373Z [2024-10-03 07:14:22.596][20][trace][connection] [external/envoy/source/common/network/connection_impl.cc:469] [Tags: "ConnectionId":"132"] raising connection event 1
2024-10-03T07:14:22.599052324Z [2024-10-03 07:14:22.596][20][trace][conn_handler] [external/envoy/source/common/listener_manager/active_stream_listener_base.cc:126] [Tags: "ConnectionId":"132"] tcp connection on event 1
2024-10-03T07:14:22.599058585Z [2024-10-03 07:14:22.596][20][debug][conn_handler] [external/envoy/source/common/listener_manager/active_stream_listener_base.cc:136] [Tags: "ConnectionId":"132"] adding to cleanup list
2024-10-03T07:14:22.599092148Z [2024-10-03 07:14:22.596][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:228] item added to deferred deletion list (size=1)
2024-10-03T07:14:22.599097278Z [2024-10-03 07:14:22.596][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:228] item added to deferred deletion list (size=2)
2024-10-03T07:14:22.599101696Z [2024-10-03 07:14:22.596][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:122] clearing deferred deletion list (size=2)
2024-10-03T07:14:22.599105864Z [2024-10-03 07:14:22.596][20][trace][config] [cilium/network_policy.cc:1158] Cilium L7 NetworkPolicyMap::onConfigUpdate(): Starting updates on the worker thread for version 247
2024-10-03T07:14:22.599109921Z [2024-10-03 07:14:22.596][20][trace][config] [cilium/network_policy.cc:1165] Cilium updating network policy for endpoint fd00:10:244:2::ed86
2024-10-03T07:14:22.599113959Z [2024-10-03 07:14:22.596][20][trace][config] [cilium/network_policy.cc:1165] Cilium updating network policy for endpoint 10.244.2.133
It is curious that the policy update starts progressing on the same exact millisecond with a closing event on connection [132]
after it has drained. There is no history for this connection in the logs, but it seems like the worker event loop was stalling until that connection had drained out?
NPDS version 247 is received at 07:14:02, some worker threads update immediately after, but other only at 07:14:22, causing Cilium Agent to not be able to get ACKs on policy updates in time (100ms timeout is much less than 20 seconds):
from
logs-cilium-envoy-kgzx9-cilium-envoy-20241003-071420.log
:Agent logs (
logs-cilium-x8hcj-cilium-agent-20241003-071420.log
):