fedora-iot / greenboot

Generic Health Checking Framework for systemd
GNU Lesser General Public License v2.1
95 stars 29 forks source link

01_repository_dns_check.sh fails if system time is incorrect. #90

Open amanning9 opened 1 year ago

amanning9 commented 1 year ago

After having quite a lot of trouble with a new clean install of Fedora 37 IOT on a raspberry pi 4, I discovered that the repository DNS check was failing due to the system time not having synced by the time greenboot-healthcheck.service was running.

I've not entirely got my head around why this might be the case. I've had problems with DNSSEC on a raspberry pi due to the time being wrong before, but I've not yet worked out if this is the problem now!

I fixed this by making greenboot wait for the correct time (see below). I'm not sure if this is the most appropriate fix, but it seems to work to allow me to boot the pi more than once, at least!

Behaviour seen:

After flashing the initial image to a raspberry pi, the first boot worked OK, but reported

Script '01_repository_dns_check.sh' FAILURE (exit code '1'). Continuing...
Boot Status is RED - Health Check FAILURE!
SYSTEM is UNHEALTHY, but bootlader entry count is 1. Manual intervention necessary.

However, attempting to update the system using rpm-ostree (and rebooting to finalise this) resulted in a continuous bootloop. The system appeared to reach the operating system OK, and it was even possible to login for a few seconds before greenboot rebooted the system. The appeared to happen indefinitely, or at least for up to about 10 minutes before I got bored!

Workaround:

To fix this, I enabled chrony-wait and made greenboot-healthcheck.service wait for time-sync.target: systemctl enable chrony-wait.service

systemctl edit greenboot-healthcheck.service and add the following:

[Unit]
After=time-sync.target
Requires=time-sync.target
nullr0ute commented 1 year ago

I've not entirely got my head around why this might be the case. I've had problems with DNSSEC on a raspberry pi due to the time being wrong before, but I've not yet worked out if this is the problem now!

DNSSEC requires the time to be set correctly to verify the certificates used by DNSSEC, the RPi (all variants) by default don't have RTCs so time isn't remembered so you end up with a chicken and egg, DNS doesn't resolve until the time is set, if the time needs a FQHN it needs DNS to resolve that.

amanning9 commented 1 year ago

To be clear, I'm not certain the problem is anything to do with DNSSEC- this problem occurs for me from a completely clean fedora 37 IOT install simply if I run rpm-ostree upgrade --reboot. I've not altered any resolver settings, or anything. DNSSEC was just what occurred to me as a possible reason.

I don't seem to have the chicken-and-egg problem- chrony does set the system time, given enough time. Its just that greenboot kills the system rather prematurely if it's not forced to wait until the system time is set before trying repository_dns_check.

pm4rcin commented 1 year ago

DNSSEC requires the time to be set correctly to verify the certificates used by DNSSEC, the RPi (all variants) by default don't have RTCs so time isn't remembered so you end up with a chicken and egg, DNS doesn't resolve until the time is set, if the time needs a FQHN it needs DNS to resolve that.

That's what has caught me today while trying to understand what is going wrong. Do you have a workaround or some solution to that problem? I'm using nts from here if that changes something.

amanning9 commented 1 year ago

My solution was to make the greenboot checks wait for whatever was setting the system time.

In my case, that meant enabling chrony-wait.service (which provides the time-sync target) and then adding a systemd drop in for greenboot-healthcheck.service to make it wait for time-sync.target to have been reached before it would start.- details of this are at the end of the original issue. This means that if the time sync target hasn't been reached, the health check won't run and force a reboot.

However, in my case, I was seeing this on a new clean just-flashed raspberry pi without changing anything about how system time was set. It was however a while ago- if I get a chance at some point I will test-flash a pi again and see if I can reproduce it now.

pm4rcin commented 1 year ago

I've tried following your steps but it didn't synchronize the clock since probably it's that chicken-egg problem mentioned above. In the end it did restart after a few minutes because of that script failing.

arkaitz-dev commented 1 month ago

I'm having the same problem, running on rpi4, with fedora iot, but with a fresh install and then a rpm-ostree upgrade. The only way to make the system reboot at the new desired state was forcing it with rpm-ostree. Is there any fix for this? should I remove default checks packages to avoid this undesired behaviour, or this "fix" is not recommend at all for fedora iot to work properly?