coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
146 stars 30 forks source link

cloud-config locksmith window creates wrong config #2463

Closed dabeck closed 5 years ago

dabeck commented 6 years ago

Issue Report

Bug

When using the cloud-config to configure a locksmith maintenance window the created locksmith config isn't configured properly.

#cloud-config
coreos:
  locksmith:
    window-start: "23:00"
    window-length: "2h"

Container Linux Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1745.5.0
VERSION_ID=1745.5.0
BUILD_ID=2018-05-31-0701
PRETTY_NAME="Container Linux by CoreOS 1745.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

OpenStack

Expected Behavior

The created locksmithd config at /run/systemd/system/locksmithd.service.d/20-cloudinit.conf should actually contain:

[Service]
Environment="LOCKSMITHD_REBOOT_WINDOW_START=23:00"
Environment="LOCKSMITHD_REBOOT_WINDOW_LENGTH=2h"

which results in:

core@dev4 ~ $ systemctl status locksmithd
Jun 19 09:08:52 dev4.novalocal systemd[1]: Started Cluster reboot manager.
Jun 19 09:08:52 dev4.novalocal locksmithd[4190]: Reboot window start is "23:00" and length is "2h"
Jun 19 09:08:52 dev4.novalocal locksmithd[4190]: Next window begins at 2018-06-19 23:00:00 +0000 UTC and ends at 2018-06-20 01:00:00 +0
Jun 19 09:08:52 dev4.novalocal locksmithd[4190]: locksmithd starting currentOperation="UPDATE_STATUS_UPDATED_NEED_REBOOT" strategy="reb
Jun 19 09:08:52 dev4.novalocal locksmithd[4190]: Waiting for 13h51m7.099641433s to reboot.

Actual Behavior

The created locksmithd config at /run/systemd/system/locksmithd.service.d/20-cloudinit.conf actually contains:

[Service]
Environment="REBOOT_WINDOW_START=23:00"
Environment="REBOOT_WINDOW_LENGTH=2h"

which results in:

core@dev4 ~ $ systemctl status locksmithd
Jun 19 08:56:52 localhost systemd[1]: Started Cluster reboot manager.
Jun 19 08:56:52 dev4.novalocal locksmithd[813]: No configured reboot window
Jun 19 08:56:52 dev4.novalocal locksmithd[813]: locksmithd starting currentOperation="UPDATE_STATUS_IDLE" strategy="reboot"

Reproduction Steps

  1. Setup a new instance using cloud-config as user-data
  2. exec journalctl -u locksmithd
lucab commented 6 years ago

/cc @rfairley

rfairley commented 6 years ago

Will investigate this week, hopefully starting tomorrow.

rfairley commented 6 years ago

Sorry for the delay in getting back to you.

I reproduced the same behaviour using QEMU on CL version 1814.0.0. There may be a quick fix to change the names of the REBOOT_WINDOW_.* environment variables that cloudinit assigns. Trying this now, will update later today.

For now, a quick workaround is to run the following at startup:

sudo sed -i -e 's/REBOOT_WINDOW_/LOCKSMITHD_REBOOT_WINDOW/g' /run/systemd/system/locksmithd.service.d/20-cloudinit.conf
sudo systemctl restart locksmithd
sudo systemctl daemon-reload

After running the above commands, systemctl status locksmithd should show the expected output.

dabeck commented 6 years ago

Thank you. Maybe I‘ll try the workaround tomorrow but I have no urgent need to get this to work.

Am 23.07.2018 um 17:07 schrieb Robert Fairley notifications@github.com:

Sorry for the delay in getting back to you.

I reproduced the same behaviour using QEMU on CL version 1814.0.0. There may be a quick fix to change the names of the REBOOTWINDOW.* environment variables that cloudinit assigns. Trying this now, will update later today.

For now, a quick workaround is to run the following at startup:

sudo sed -i -e 's/REBOOTWINDOW/LOCKSMITHD_REBOOT_WINDOW/g' /run/systemd/system/locksmithd.service.d/20-cloudinit.conf sudo systemctl restart locksmithd sudo systemctl daemon-reload After running the above commands, systemctl status locksmithd should show the expected output.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

rfairley commented 6 years ago

To update:

The fields REBOOT_WINDOW* are still valid, without LOCKSMITHD_ prefixed - seems this is not the problem. Therefore, only the following are necessary for the workaround:

(on startup)

sudo systemctl restart locksmithd
sudo systemctl daemon-reload

Rebooting the system after the cloud-config is also a workaround.

dabeck commented 6 years ago

You are right! My bad. Then it's maybe dependent on cloud-init which runs after locksmith has already started on first boot?!

rfairley commented 6 years ago

Yes - it seems on first boot there is a data race where cloud-init parses the YAML (cloud-config) file to create the 20-cloudinit.conf file, but locksmithd reads the file too soon before the environment variables are written. Could be either due to locksmithd starting too soon, or that some signal is needed to tell locksmithd when to read the file. Looking into this.

rfairley commented 6 years ago

Just to note on the env vars - only REBOOT_WINDOW_* variables are supported without having LOCKSMITHD_ prefixed. It is confusing as the docs only mention the prefixed version https://github.com/coreos/locksmith/blob/master/README.md#reboot-windows.

dm0- commented 6 years ago

Since it hasn't been mentioned yet: cloud-config has been superseded by Ignition for provisioning, which fixes races such as this. The Container Linux provisioning documentation is here: https://coreos.com/os/docs/latest/provisioning.html

dabeck commented 6 years ago

@dm0- I've already seen this but my configuration with a custom OpenStack provider doesn't allow anything besides cloud-init. At least I didn't get it to work.

dm0- commented 6 years ago

Ignition configuration should be usable through user-data in the same way as cloud-config: https://coreos.com/os/docs/latest/booting-on-openstack.html

If it's your OpenStack provider that is preventing the configuration format, there's not much we can do, but you could open an Ignition bug if the platform is passing the config and it's still not working.

rfairley commented 6 years ago

Confirmed that the reboot window for locksmithd is configured correctly using Ignition (https://coreos.com/os/docs/latest/update-strategies.html#auto-updates-with-a-maintenance-window).

Either using Ignition, rebooting, or restarting locksmithd on startup are viable workarounds.

bgilbert commented 5 years ago

Thank you for reporting this issue. Unfortunately, we don't think we'll end up addressing it in Container Linux.

We're now working on Fedora CoreOS, the successor to Container Linux, and we expect most major development to occur there instead. Meanwhile, Container Linux will be fully maintained into 2020 but won't see many new features. We appreciate your taking the time to report this issue and we're sorry that we won't be able to address it.