coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
264 stars 59 forks source link

systems without RTC could fail to pull Ignition on first boot #1624

Open jdoss opened 11 months ago

jdoss commented 11 months ago

Describe the bug

I am currently trying to get a cluster of Raspberry Pi CM4s to iPXE boot and install FCOS. The current problem I am encountering is the time on the CM4s when they boot is incorrect. This is due to them not having access to an RTC to keep the correct time. My ignition endpoint uses a Let's Encrypt TLS cert that was issued two days ago and my CM4s all think it is Sept 27 2023, so they think the certificate is not yet valid. I believe systemd is setting the time to it's build date since there is no other time source available.

[  127.030708] ignition[656]: GET error: Get "https://netboot.example.biz/ign/config.ign": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2023-09-27T00:02:03Z is before 2023-12-03T
[  129.860809] ignition[653]: failed to fetch config: unable to fetch resource in time
[  129.868664] ignition[653]: failed to acquire config: unable to fetch resource in time
[  129.877137] systemd[1]: ignition-fetch.service: Main process exited, code=exited, status=1/FAILURE
[  129.886415] ignition[653]: Ignition failed: unable to fetch resource in time
[  129.893634] systemd[1]: ignition-fetch.servie: Failed with result 'exit-code'.
[FAILED] Failed to start ignition-fetch.service - Ignition (fetch).

This blog post explains things pretty well and offers up a solution for a system is already provisioned and online.

This unfortunately doesn't help me out in this case since I need the time set before Ignition runs. I can think of a few ways to fix this issue.

1) Create a custom initramfs and touch /usr/lib/clock-epoch. 2) Add support for a fixrtc like karg like Ubuntu has but be able to feed in the time via the karg (Which I can set dynamically in my iPXE script endpoint) and have FCOS set the time in the initramfs. 3) Have the initramfs sync time via chronyd before Ignition runs.

Doing number one kind of defeats the purpose of iPXE booting as I want to use a vanilla FCOS from upstream for my workloads. Number 2 is a clever work around and maybe worth exploring. Number three would require adding chronyd to the initramfs and syncing time before anything starts which I think is the best solution since having accurate time before ignition runs is a good thing.

Are there any other options I am not seeing?

Reproduction steps

Do a iPXE install of FCOS on a system that has no RTC that pulls its ignition from a HTTPS endpoint with a fairly new TLS cert.

Expected behavior

Time to be sync'd so we have accurate time before Ignition runs so TLS certs do not fail.

Actual behavior

Maintenance mode :(

System details

FCOS Stable 39.20231101.3.0 aarch64

#!ipxe

show unixtime
ntp 0.pool.ntp.org
show unixtime

set ARCH aarch64
set STREAM stable
set VERSION 39.20231101.3.0
set CONFIGURL https://netboot.example.biz/ign/config.ign

set BASEURL https://builds.coreos.fedoraproject.org/prod/streams/${STREAM}/builds/${VERSION}/${ARCH}

kernel ${BASEURL}/fedora-coreos-${VERSION}-live-kernel-${ARCH} initrd=main coreos.live.rootfs_url=${BASEURL}/fedora-coreos-${VERSION}-live-rootfs.${ARCH}.img ignition.firstboot ignition.platform.id=metal ignition.config.url=${CONFIGURL}
initrd --name main ${BASEURL}/fedora-coreos-${VERSION}-live-initramfs.${ARCH}.img

boot

Butane or Ignition config

N/A We don't even make it this far.

Additional information

@dustymabe We are this close to having iPXE working with FCOS on Raspberry Pi 4 hardware

jlebon commented 11 months ago

Did you try systemd.clock-usec?

jdoss commented 11 months ago

@jlebon \o/ my hero! That worked super great!!

I'd still like to hear people's thoughts on if syncing time before fetching remote Ignition is a good idea. I think accurate time is a good thing, but maybe there are design reasons why we don't do it.

dustymabe commented 11 months ago

It's not really a problem we encounter often because most systems have an RTC. I don't really think the extra time it would add to the boot OR the added complexity of trying to figure out how to allow a user to specify what ntp servers they wanted to use during the initramfs of the first boot of a system would really be worth the effort.

Note since you are using a RPi4 I recommend using systemd-timesyncd over chrony in the real root because it has a mechanism for getting the clock back to sanity to the timestamp of a file in /var/ earlier in boot than NTP (and thus your journal log timestamps won't be as confusing).

dustymabe commented 11 months ago

Another option you have is also to bake the ignition config in the initramfs so no network is needed. See https://coreos.github.io/coreos-installer/customizing-install/#creating-customized-iso-and-pxe-images

jdoss commented 11 months ago

It's not really a problem we encounter often because most systems have an RTC. I don't really think the extra time it would add to the boot OR the added complexity of trying to figure out how to allow a user to specify what ntp servers they wanted to use during the initramfs of the first boot of a system would really be worth the effort.

I figured as much. I admit this is a rare case where an RTC is not available but not having accurate time can impact systems with an RTC that is set wrong. I was honestly kinda surprised that FCOS didn't sync its time right after it got network access.

Note since you are using a RPi4 I recommend using systemd-timesyncd over chrony in the real root because it has a mechanism for getting the clock back to sanity to the timestamp of a file in /var/ earlier in boot than NTP (and thus your journal log timestamps won't be as confusing).

As of right now, I am running FCOS directly from RAM and I need to figure out if I want to keep these CM4s stateless or do an install. Thanks for the tip. I will look into it.

Another option you have is also to bake the ignition config in the initramfs so no network is needed. See https://coreos.github.io/coreos-installer/customizing-install/#creating-customized-iso-and-pxe-images

I thought of that too, but I'd have to automate keeping my own mirror up to date and also the baking of an ignition into the initramfs which is a lot of work.

dustymabe commented 11 months ago

We discussed this in the community meeting today:

12:10:26  dustymabe | !agreed We don't think this issue is a high priority
                      because there aren't many systems that we target that
                      don't have an RTC. As mentioned there are systems with
                      an RTC that is wrong, but in that case it's easy to
                      remedy by setting the RTC to a correct value. We could
                      improve by giving users an ignition.config.checksum option
                      to go along with ignition.config.url, but it's still a
                      workaround and probably not worth the effort.

To illustrate the ignition.config.checksum option mentioned above further, inside the Ignition config today you can specify a remote resource like:

variant: fcos
version: 1.1.0
storage:
  files:
    - path: /opt/file2
      contents:
        source: http://example.com/file2
        compression: gzip
        verification:
          hash: sha512-4ee6a9d20cc0e6c7ee187daffa6822bdef7f4cebe109eff44b235f97e45dc3d7a5bb932efc841192e46618f48a6f4f5bc0d15fd74b1038abf46bf4b4fd409f2e
      mode: 0644

and part of that you can specify a checksum that will allow for verification of that artifact. So if you pull over an insecure medium you can be OK with that because you also told Ignition what to expect the contents to look like.

We don't have a mechanism like this today for ignition.config.url but we could add one, which would either allow you to ignore verfication when pulling over TLS/HTTPS OR pull over HTTP and be happy with the results.

That is something we could do (or a community member could do), but is probably low priority.

purpleidea commented 9 months ago

When PXE booting and using kickstart with an x86_64 Fedora 38, on an old 64 bit, but pre-UEFI Core 2 Duo machine, I eventually get this error when trying to run the package transaction.

XXX... does not verify ... BAD

Could this be caused by a bad time as well? The RTC definitely has a bad battery. I came across this issue and it all clicked!

image

dustymabe commented 9 months ago

hey @purpleidea (long time no see).

I think your comment is off topic as this is the Fedora CoreOS issue tracker and you aren't using Fedora CoreOS.

dustymabe commented 9 months ago

but yes, maybe see https://bugzilla.redhat.com/show_bug.cgi?id=2242759 ?

purpleidea commented 9 months ago

I think your comment is off topic as this is the Fedora CoreOS issue tracker and you aren't using Fedora CoreOS.

Woops, my bad. Thanks for the link. Hope all is well with you. I'll be at FOSDEM or online in the usual places in case you want to catch up.