Update NTP makestep for qemu

xordspar0 commented 1 year ago

In a6ed7b3a1ebe4a97febe3dbfab88222fc5c42f76, the NTP configuration for FCOS VM images running in cloud hosts was updated so that Chrony updates the system time to match NTP time immediately instead of gradually over a long period of time. This came out of a bug report that affects a variety of VM environments, not just cloud deployments.

The clock getting out of sync at least affects Podman's qemu VMs running on laptops that go to sleep occasionally, as detailed in this bug report: https://github.com/containers/podman/issues/11541

Should we make the same change for all qemu images, something like this, in coreos-platform-chrony?

  platform=$(karg ignition.platform.id)
  case "${platform}" in
-     azure|azurestack|aws|gcp) ;;  # OK, this is a platform we know how to support
+     azure|azurestack|aws|gcp|qemu) ;;  # OK, this is a platform we know how to support
      *) exit 0 ;;
  esac

  ...

  (echo "# Generated by $self - do not edit directly"
   sed -e s,'^makestep,#makestep,' -e s,'^pool,#pool,' -e s,'^leapsectz,#leapsectz,' < /etc/chrony.conf
  cat <<EOF

  # Allow the system clock step on any clock update.
  # It will avoid the time resynchronization issue when VMs are resumed from suspend.
  # See https://bugzilla.redhat.com/show_bug.cgi?id=1780165 for more information.
  makestep 1.0 -1

  EOF
  ) > "${confpath}"

It's not clear to me why qemu's default RTC setting of host doesn't cover this issue, but the fact is that it doesn't (according to my experience, the experience of the people reporting the Podman bug, and others), and NTP seems to be the only reliable way to keep a FCOS VM's clock in sync.

I can't say with certainty that the is the right choice for all VMs, or even all qemu VMs, but it makes sense to me that it should be the default for VMs running on a laptop and possibly in other cases.

jlebon commented 1 year ago

I can't say with certainty that the is the right choice for all VMs, or even all qemu VMs, but it makes sense to me that it should be the default for VMs running on a laptop and possibly in other cases.

Right. The problem is that QEMU as a platform can be used in many different contexts, and there's no easy way to tell from within the guest in what context it's being used (e.g. developer's laptop vs. production). The linked RHBZ in the generator mentions that there could be compatibility and security issues with allowing steps all the time (see https://bugzilla.redhat.com/show_bug.cgi?id=1780165#c6). The platforms where we currently enable this have cloud-managed endpoints we've accepted to trust.

It's not clear to me why qemu's default RTC setting of host doesn't cover this issue, but the fact is that it doesn't (according to my experience, the experience of the people reporting the Podman bug, and others)

That's interesting. That user post got no replies on the QEMU list, but it sounds like there may be a bug there. Using ptp_kvm would be another way to fix this (related: https://github.com/coreos/fedora-coreos-config/pull/2263), but may not be available on ARM. The easiest workaround for podman machines would probably be for podman to enable stepping at provisioning time like we do on those cloud platforms, assuming the related concerns are deemed acceptable.

xordspar0 commented 1 year ago

The linked RHBZ in the generator mentions that there could be compatibility and security issues with allowing steps all the time

Understood

That's interesting. That user post got no replies on the QEMU list, but it sounds like there may be a bug there.

Yes, this is an XKCD #979 moment. It would be really convenient if the hw clock worked reliably; customizing NTP settings for this use-case wouldn't be necessary.

Using ptp_kvm would be another way to fix this (related: https://github.com/coreos/fedora-coreos-config/pull/2263)

Ha, it's funny that someone else proposed a very similar change to solve a different problem at around the same time.

The easiest workaround for podman machines would probably be for podman to enable stepping at provisioning time like we do on those cloud platforms, assuming the related concerns are deemed acceptable.

I agree, I'll bring this up with Podman.

coreos / fedora-coreos-tracker

Update NTP makestep for qemu #1431