GoogleCloudPlatform / guest-agent

Apache License 2.0
124 stars 78 forks source link

google-guest-agent after 20240701.00 persists a file that locks systemd-networkd to a specific interface device name #401

Open char8 opened 1 month ago

char8 commented 1 month ago

We pulled in a new release of the guest agent (1:20240701.00-g1) incorporating https://github.com/GoogleCloudPlatform/guest-agent/pull/396 and https://github.com/GoogleCloudPlatform/guest-agent/pull/386 during a packer build of a new VM image.

This guest agent now writes a file /etc/netplan/20-google-guest-agent-ethernet.yaml with the contents:

network:
    version: 2
    ethernets:
        ens5:
            match:
                name: ens5
            mtu: 1460
            dhcp4: true
            dhcp4-overrides:
                use-domains: true

vs. the previous default /etc/netplan/90-default.yaml

network:
    version: 2
    ethernets:
        all-en:
            match:
                name: en*
            dhcp4: true
            dhcp4-overrides:
                use-domains: true
            dhcp6: true
            dhcp6-overrides:
                use-domains: true
        all-eth:
            match:
                name: eth*
            dhcp4: true
            dhcp4-overrides:
                use-domains: true
            dhcp6: true
            dhcp6-overrides:
                use-domains: true

the interface on the build instance is ens4 and the 20-google-guest-agent-ethernet.yaml file hardcodes that interface name into the image.

When the Image is run on a new VM, if that VM has a different network interface name (eg: we're seeing ens5 on some VMs), the network interface fails to come up since the declaration in the condig file is missing. This effectively breaks networking on the box as the ens5 interface is never brought up because /run/systemd/network/10-netplan-all-en.network is missing.

We're running the debian-cloud/debian-12 image with:

netplan.io                            0.106-2+deb12u1
systemd                               252.26-1~deb12u2
google-guest-agent           1:20240701.00-g1

Post reboot; the guest agent is crashing because it can't reach the metadata API (since ens5 is not up), so it wouldn't be able to presumably re-generate the config for the new interface name.

2024-07-16T03:39:09.740174+00:00 packer-6695cb52-72cf-9b08-c0c0-dcfffc97fcf8 google_guest_agent[1121]: ERROR instance_setup.go:159 Failed to reach MDS(all retries exhausted): exhausted all (100) retries, last error: request failed with status code: [-1], error: [error connecting to metadata server: Get "http://169.254.169.254/computeMetadata/v1/?alt=json&recursive=true&timeout_sec=60": dial tcp 169.254.169.254:80: connect: network is unreachable]
2024-07-16T03:39:09.740991+00:00 packer-6695cb52-72cf-9b08-c0c0-dcfffc97fcf8 systemd[1]: google-guest-agent.service: Main process exited, code=exited, status=1/FAILURE
2024-07-16T03:39:09.741072+00:00 packer-6695cb52-72cf-9b08-c0c0-dcfffc97fcf8 systemd[1]: google-guest-agent.service: Failed with result 'exit-code'.
char8 commented 1 month ago

updated the ticket after tracing the behaviour change to #386. We've pinned ourselves to 1:20240528.00-g1 for the moment. It would seem that the guest agent should regenerate the netplan configs on boot based on the ifname; but we're not seeing it do that.

char8 commented 1 month ago

We've further identified the reason for the network interface name change to be the addition of local SSDs on the VM running the image (the VM running packer to build the image lacked NVMEs). The NVMe takes PCIe slot ID 4, pushing the NIC to PCIe slot 5, resulting in the change from ens4 -> ens5 based on systemd's naming scheme.

ChaitanyaKulkarni28 commented 1 month ago

We're looking into this issue. Does instance have any networking or its completely broken?

JakeCooper commented 1 month ago

Networking is completely broken. Cannot SSH or anything

restingbull commented 1 month ago

We have the same issue.

gaughen commented 1 month ago

We have a package built from a previous stable state (with a newer version number) and are finishing up upgrade testing from the version with the issue to the previous stable package. We hope to release it shortly, and will update this bug with an ETA as soon as we have it.

mikehardenize commented 1 month ago

We also have broken networking after guest-agent updated on our system.

We're getting logs like:

Jul 17 13:19:34 app1-staging-app google_guest_agent[1683]: Rolling back systemd-networkd
Jul 17 13:19:34 app1-staging-app google_guest_agent[1683]: rolling back changes for systemd-networkd

It seems related to this:

Jul 17 12:55:30 app1-staging-app google_guest_agent[1665]: Setting up NetworkManager
Jul 17 12:55:30 app1-staging-app NetworkManager[779]: <warn>  [1721220930.0963] keyfile: guest-agent: invalid setting name 'guest-agent'
Jul 17 12:55:30 app1-staging-app s3fs[1037]: ### retrying...

We're running a Rocky 8 system and NetworkManager doesn't seem to understand the config that google-guest-agent is placing at /etc/NetworkManager/system-connections/google-guest-agent-eth0.nmconnection. The version of NetworkManager on this system is NetworkManager-1.40.16-15.el8_9.x86_64. We have a Rocky 9 system where this isn't a problem, but that one has a newer version of NetworkManager which does understand the config: NetworkManager-1.46.0-8.el9_4.x86_64

gaughen commented 1 month ago

The team will be releasing the new (old) version this morning.

ChaitanyaKulkarni28 commented 1 month ago

Thanks for bringing up the issues you're seeing. We have rolled back the changes and released another agent 20240716.00 version.

a-crate commented 1 month ago

Hi @mikehardenize , we believe this might be a different issue. Can you provide reproduction steps and what functionality is not working that you're expecting to work? The log messages you have posted are expected, the agent will log Rolling back X for every network management software that it's not going to configure, and it prints Setting up NetworkManager with no error log after indicating successfully writing NetworkManger config files.

The NetworkManger warnings indicate that it doesn't understand the guest-agent ini key but this is expected and as far as we can tell the configuration file is otherwise applied and the NIC is activated and configured correctly. Is there some functionality breakage you're seeing? We can't reproduce the broken networking on Rocky Linux 8.

JakeCooper commented 1 month ago

I gotta say it is scary as heck to see pull requests like this

CleanShot 2024-07-17 at 13 51 01

We're lucky we run a fleet of canary instances and caught this early before it made it to our other instances.

I won't be dogmatic about the solution, but, given the total littany of issues we're run into with Google cloud, this is one of the many nails in the coffin of "This cannot be trusted for production infrastructure"

drewhli commented 1 month ago

For anyone still facing these issues, follow these steps to work around and update the guest-agent. Similar steps can be found here.

  1. Detach the boot disk from the affected VM.
  2. Attach the detached disk to a different VM that's working.
  3. SSH into the second VM and mount the new disk to a folder. If the new disk is designated as sdb, the command may look like the following:
    sudo mkdir /mnt/data && sudo mount /dev/sdb1 /mnt/data
  4. Delete the file /mnt/data/etc/netplan/20-google-guest-agent-ethernet.yaml.
  5. Re-add the default netplan configuration file: Create a new file /mnt/data/etc/netplan/90-default.yaml, and paste the following contents inside:
    network:
    version: 2
    ethernets:
        all-en:
            match:
                name: en*
            dhcp4: true
            dhcp4-overrides:
                use-domains: true
            dhcp6: true
            dhcp6-overrides:
                use-domains: true
        all-eth:
            match:
                name: eth*
            dhcp4: true
            dhcp4-overrides:
                use-domains: true
            dhcp6: true
            dhcp6-overrides:
                use-domains: true
  6. Unmount the disk
    sudo umount /dev/sdb1
  7. Detach the disk from the second VM and re-attach it to the original VM. SSH and networking should be working again.
mikehardenize commented 1 month ago

Hi @mikehardenize , we believe this might be a different issue. Can you provide reproduction steps and what functionality is not working that you're expecting to work?

It's a Rocky Linux 8 system. We are using wireguard via systemd-networkd. After the upgrade of guest-agent, a default route is added for that wireguard gateway ip. This breaks our networking. The default route wasn't added prior to the upgrade.