gardener / gardener

Homogeneous Kubernetes clusters at scale on any infrastructure using hosted control planes.
https://gardener.cloud
Apache License 2.0
2.92k stars 479 forks source link

Gardener Node Agent deletes containerd drop-in directory #10809

Open Roncossek opened 5 days ago

Roncossek commented 5 days ago

How to categorize this issue?

/area os /kind bug

What happened:

Gardener-Node-Agent deletes containerd drop-in directory when drop-in gets removed from OSC and no more reference to systemd unit exists.

Details:

We deployed a shoot with an OSC containing the following extension units in the status field:

...
status:
  extensionUnits:
  - dropIns:
    - content: |
        [Service]
        ExecStartPre=/opt/gardener/bin/containerd_cgroup_driver.sh
      name: 10-configure-cgroup-driver.conf
    filePaths:
    - /opt/gardener/bin/g_functions.sh
    - /opt/gardener/bin/containerd_cgroup_driver.sh
    name: containerd.service
  - dropIns:
    - content: |
        [Service]
        ExecStartPre=/opt/gardener/bin/kubelet_cgroup_driver.sh
      name: 10-configure-cgroup-driver.conf
    filePaths:
    - /opt/gardener/bin/g_functions.sh
    - /opt/gardener/bin/kubelet_cgroup_driver.sh
    name: kubelet.service

This results in the following files to be present:

ls /etc/systemd/system/containerd.service.d

10-configure-cgroup-driver.conf  
11-exec_config.conf 
30-env_config.conf  
override.conf

Only one of these files was delivered by the OSC.

The OSC status was updated by the operating system controller and the extensionUnits were removed from it; the containerd extension unit.

...
status:
  extensionUnits:
  - dropIns:
    - content: |
        [Service]
        ExecStartPre=/opt/gardener/bin/kubelet_cgroup_driver.sh
      name: 10-configure-cgroup-driver.conf
    filePaths:
    - /opt/gardener/bin/g_functions.sh
    - /opt/gardener/bin/kubelet_cgroup_driver.sh
    name: kubelet.service

As the containerd.service was defined only once in the status field, the entire drop-in directory was deleted.

Even tho we just intended to remove one systemd drop-in file, the code in here identifies the unit as to be deleted. And the code here deletes the entire systemd drop-in directory.

What you expected to happen:

Gardener node agent should not delete an entire drop-in directory with drop-ins it never created.

How to reproduce it (as minimally and precisely as possible):

See above

Anything else we need to know?:

In the garden-linux extension a drop-in for containerd is deployed that should no longer be deployed going forward. As a result we are removing the entire unit from the status.extensionUnits as described above.

For this bug to occur it is important that we deployed a drop-in for a systemd unit delivered by the OS vendor. This does not happen to systemd units delivered by gardener directly.

Environment:

oliver-goetz commented 4 days ago

/assign