Azure / WALinuxAgent

Microsoft Azure Linux Guest Agent
http://azure.microsoft.com/
Apache License 2.0
533 stars 370 forks source link

[BUG] AlmaLinux 8 support #3052

Open filipeferreira89 opened 6 months ago

filipeferreira89 commented 6 months ago

Describe the bug: A clear and concise description of what the bug is.

Even though RHEL 8 and AlmaLinux are almost identical twins, there is no official support yet for AlmaLinux 8, whoever it's already supporting AlmaLinux 9.

We identified a bug with the WALinuxAgent 2.9.1.1 version on AlmaLinux 8 VMs. Sometimes, when we trigger a change on the hostname of the virtual machine with the following command hostnamectl set-hostname testvm000008a, WALinuxAgent triggers a restart of the NetworkManager Service and the eth0 can enter in a failed state and the vm loses any routes.

waagent.log

INFO EnvHandler ExtHandler EnvMonitor: Detected hostname change: testvm000008A -> testvm000008a
INFO EnvHandler ExtHandler Examine /proc/net/route for primary interface
INFO EnvHandler ExtHandler Primary interface is [eth0]
INFO EnvHandler ExtHandler Examine /proc/net/route for primary interface
INFO EnvHandler ExtHandler Primary interface is [eth0]
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8459] dhcp: init: Using DHCP client 'internal'
Feb  7 18:00:08 testvm000008A waagent[10128]: 2024-02-07T18:00:08.849389Z INFO EnvHandler ExtHandler Examine /proc/net/route for primary interface
Feb  7 18:00:08 testvm000008A waagent[10128]: 2024-02-07T18:00:08.849802Z INFO EnvHandler ExtHandler Primary interface is [eth0]
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8459] device (lo): carrier: link connected
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8461] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8468] device (eth0): carrier: link connected
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8473] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8477] manager: (eth0): assume: will attempt to assume matching connection 'System eth0' (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03) (indicated)
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8478] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'assume')
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8483] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'assume')
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8490] device (eth0): Activation: starting connection 'System eth0' (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03)
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8493] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'assume')
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8495] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'assume')
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8497] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'assume')
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8500] dhcp4 (eth0): activation: beginning transaction (timeout in 300 seconds)
Feb  7 18:00:08 testvm000008A waagent[10128]: 2024-02-07T18:00:08.856208Z INFO EnvHandler ExtHandler Examine /proc/net/route for primary interface
Feb  7 18:00:08 testvm000008A waagent[10128]: 2024-02-07T18:00:08.856608Z INFO EnvHandler ExtHandler Primary interface is [eth0]
Feb  7 18:00:08 testvm000008A dbus-daemon[837]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.51' (uid=0 pid=35175 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8589] dhcp4 (eth0): state changed new lease, address=AAA.BBB.CCC.DDD
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.8593] policy: set 'System eth0' (eth0) as default for IPv4 routing and DNS
Feb  7 18:00:08 testvm000008A kernel: IPv4: martian source AAA.BBB.CCC.DDD from 169.254.169.254, on dev eth0
Feb  7 18:00:08 testvm000008A kernel: ll header: 00000000: 00 0d 3a 29 d6 67 12 34 56 78 9a bc 08 00        ..:).g.4Vx....
Feb  7 18:00:08 testvm000008A kernel: IPv4: martian source AAA.BBB.CCC.DDD from 169.254.169.254, on dev eth0
Feb  7 18:00:08 testvm000008A kernel: ll header: 00000000: 00 0d 3a 29 d6 67 12 34 56 78 9a bc 08 00        ..:).g.4Vx....
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.9113] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'assume')
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.9134] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'assume')
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.9136] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'assume')
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.9138] manager: NetworkManager state is now CONNECTED_SITE
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.9140] device (eth0): Activation: successful, device activated.
Feb  7 18:00:08 testvm000008A NetworkManager[35175]: <info>  [1707328808.9144] manager: startup complete
Feb  7 18:00:08 testvm000008A systemd[1]: Started Network Manager Wait Online.
Feb  7 18:00:08 testvm000008A systemd[1]: nm-cloud-setup.service: Succeeded.
Feb  7 18:00:08 testvm000008A systemd[1]: Stopped Automatically configure NetworkManager in cloud.
Feb  7 18:00:08 testvm000008A systemd[1]: Starting Automatically configure NetworkManager in cloud...
Feb  7 18:00:08 testvm000008A systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive.
Feb  7 18:00:16 testvm000008A systemd[1]: nm-cloud-setup.service: Succeeded.
Feb  7 18:00:16 testvm000008A systemd[1]: Started Automatically configure NetworkManager in cloud.
Feb  7 18:00:19 testvm000008A systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Feb  7 18:00:38 testvm000008A systemd[1]: systemd-hostnamed.service: Succeeded.
Feb  7 18:00:42 testvm000008A waagent[10128]: 2024-02-07T18:00:42.974630Z ERROR ExtHandler ExtHandler Error fetching the goal state: [ProtocolError] [Wireserver Exception] [HttpError] [HTTP Failed] GET http://168.63.129.16/machine/ -- IOError [Errno 101] Network is unreachable -- 6 attempts made
2024-02-04T14:56:07.661404Z INFO EnvHandler ExtHandler EnvMonitor: Detected hostname change: testvm000008A -> testvm000008a
2024-02-07T14:56:07.787131Z INFO EnvHandler ExtHandler Examine /proc/net/route for primary interface
2024-02-07T14:56:07.787731Z INFO EnvHandler ExtHandler Primary interface is [eth0]
2024-02-07T14:56:07.796409Z INFO EnvHandler ExtHandler Examine /proc/net/route for primary interface
2024-02-07T14:56:07.796754Z INFO EnvHandler ExtHandler Primary interface is [eth0]

Note: Please add some context which would help us understand the problem better

  1. Section of the log where the error occurs.
  2. Serial console output
  3. Steps to reproduce the behavior.

Distro and WALinuxAgent details (please complete the following information):

maddieford commented 6 months ago

Thanks for opening this issue @filipeferreira89. We recently became aware of this and have a fix for redhat 7.* in our new release that we are just starting. We'll continue addressing in other fedora distros in subsequent releases. We will prioritize endorsed distros, so alma 8 may take us longer.

Could you please share the exact version of alma 8 you're seeing this on?

filipeferreira89 commented 6 months ago

Hi @maddieford, We are seeing it on AlmaLinux 8.7 and 8.9, but I believe that the bug is there for any previous "minors". When the release is done for redhat 7, we can test it out to see if it works also on AlmaLinux 8. Since you talked about rhel 7, I believe we were having the same issue (with less occurrences) on CentOS 7, but we never did a proper investigation like this one.

maddieford commented 6 months ago

@filipeferreira89 The fix in the next release includes CentOS 7. Could you please also share the full waagent.log? Or at least 10 minutes of logs after this failure:

2024-02-07T18:00:42.974630Z ERROR ExtHandler ExtHandler Error fetching the goal state: [ProtocolError] [Wireserver Exception] [HttpError] [HTTP Failed] GET http://168.63.129.16/machine/ -- IOError [Errno 101] Network is unreachable -- 6 attempts made

filipeferreira89 commented 6 months ago

waagent.log This might not be by the same vm as the other logs, but the issue is the same and they both were configured through the same process and commands

maddieford commented 6 months ago

Thanks. Also, what regions are you in for CentOS 7? As we progress our release with the fix for fedora 7.* versions, are you able to confirm you're no longer seeing the issue? I can update here when we've promoted the release to the regions you're in.

filipeferreira89 commented 6 months ago

We use West Europe, North Europe and China East 2. Yes, as soon you have a new version I can test to see if I can replicate the issue.

maddieford commented 6 months ago

@filipeferreira89 We just published 2.10.0.8 with the fix for centos 7 hostname publishing to North Europe. Would you be able to validate you're no longer seeing this issue there for CentOs 7? Thank you