CI/CD e2e-tier-0 tests failures in bluechi-controller

Yarboa commented 1 month ago

Tiers e2e tests started to fail on qm-node

Jun 14 15:48:57 control bluechi-controller[158]: Node 'qm-node1' disconnected Jun 14 15:49:53 control bluechi-controller[158]: Registered managed node from fd 11 as 'qm-node1'

While bluechi-agent inside qm indicates log connectivity

podman exec  control bash -c "systemctl status bluechi-controller"
● bluechi-controller.service - BlueChi Controller systemd service
     Loaded: loaded (/usr/lib/systemd/system/bluechi-controller.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-06-14 14:12:45 UTC; 1h 40min ago

Jun 14 15:52:16 control bluechi-controller[158]: Node 'qm-node1' disconnected
Jun 14 15:52:45 control bluechi-controller[158]: Registered managed node from fd 11 as 'qm-node1'

While the following

podman exec node1 bash -c "systemctl status bluechi-agent"
● bluechi-agent.service - BlueChi systemd service controller agent daemon
     Loaded: loaded (/usr/lib/systemd/system/bluechi-agent.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-06-14 14:14:00 UTC; 1h 40min ago

Jun 14 14:14:00 node1 systemd[1]: Started BlueChi systemd service controller agent daemon.
Jun 14 14:14:00 node1 bluechi-agent[1960]: Starting bluechi-agent 0.9.0-0.202405230627.git23191d3
Jun 14 14:14:00 node1 bluechi-agent[1960]: Connecting to controller on tcp:host=10.90.0.2,port=842
Jun 14 14:14:00 node1 bluechi-agent[1960]: Connected to controller as 'node1'

More tests here, it seems that network is down every 60-90 seconds Install in node1 dnf -y install --releasever 9 --installroot /usr/lib/qm/rootfs python iputils

ControllerHost=10.90.0.2

[root@node1 ~]# podman exec qm bash -c "ping 10.90.0.2"

bash-5.1# ping  10.90.0.2
PING 10.90.0.2 (10.90.0.2) 56(84) bytes of data.
64 bytes from 10.90.0.2: icmp_seq=1 ttl=63 time=0.255 ms
64 bytes from 10.90.0.2: icmp_seq=14 ttl=63 time=0.213 ms
**64 bytes from 10.90.0.2: icmp_seq=48 ttl=63 time=2368 ms
64 bytes from 10.90.0.2: icmp_seq=49 ttl=63 time=1344 ms**
64 bytes from 10.90.0.2: icmp_seq=50 ttl=63 time=320 ms
64 bytes from 10.90.0.2: icmp_seq=51 ttl=63 time=0.282 ms
64 bytes from 10.90.0.2: icmp_seq=52 ttl=63 time=0.136 ms
64 bytes from 10.90.0.2: icmp_seq=53 ttl=63 time=0.135 ms
64 bytes from 10.90.0.2: icmp_seq=54 ttl=63 time=0.216 ms

It also happens from the namespace itself

[root@node1 ~]# ip netns exec netns-f7133da7-ba6f-4ba2-f366-ad80f5835436 ping 10.90.0.2
PING 10.90.0.2 (10.90.0.2) 56(84) bytes of data.

While ping from node1 to controller adress is not uninterruptible

Yarboa commented 1 month ago

I also see this from ip netns

[root@node1 ~]# ip netns exec netns-f7133da7-ba6f-4ba2-f366-ad80f5835436 netstat -st
IcmpMsg:
    InType0: 2594
    OutType3: 1
    OutType8: 9218
Tcp:
    1756 active connection openings
    0 passive connection openings
    0 failed connection attempts
    1717 connection resets received
    1 connections established
    136362 segments received
    141673 segments sent out
    14798 segments retransmitted
    0 bad segments received
    16 resets sent
UdpLite:
TcpExt:
    3 TCP sockets finished time wait in fast timer
    9 packets rejected in established connections because of timestamp
    1747 delayed acks sent
    Quick ack mode was activated 763 times
    1759 packet headers predicted
    61560 acknowledgments not containing data payload received
    32411 predicted acknowledgments
    TCPLostRetransmit: 11314
    TCPTimeouts: 13056
    TCPLossProbes: 1742
    TCPBacklogCoalesce: 5
    TCPDSACKOldSent: 763
    TCPRcvCoalesce: 70
    TCPOrigDataSent: 39469
    TCPKeepAlive: 62338
    TCPDelivered: 37735
    TcpTimeoutRehash: 13056

@dougsland Maybe need to sync all containers with ntp

Yarboa commented 1 month ago

I also see this

[root@default-0 ~]# date
Sun Jun 16 04:37:05 AM EDT 2024
[root@default-0 ~]# podman exec -it node1 bash
[root@node1 ~]# date
Sun Jun 16 08:37:17 UTC 2024
[root@node1 ~]# 
[root@node1 ~]# exit
exit
[root@default-0 ~]# podman exec -it control bash
[root@control ~]# date
Sun Jun 16 08:37:33 UTC 2024

Need to check adding --tz=local to control and node1

Followed this blog https://www.redhat.com/sysadmin/tick-tock-container-time

dougsland commented 3 weeks ago

I also see this

[root@default-0 ~]# date
Sun Jun 16 04:37:05 AM EDT 2024
[root@default-0 ~]# podman exec -it node1 bash
[root@node1 ~]# date
Sun Jun 16 08:37:17 UTC 2024
[root@node1 ~]# 
[root@node1 ~]# exit
exit
[root@default-0 ~]# podman exec -it control bash
[root@control ~]# date
Sun Jun 16 08:37:33 UTC 2024

Need to check adding --tz=local to control and node1

I remember this one: https://github.com/containers/qm/issues/394

Followed this blog https://www.redhat.com/sysadmin/tick-tock-container-time

containers / qm

CI/CD e2e-tier-0 tests failures in bluechi-controller #462