Open arnegroskurth opened 1 year ago
Your Ignition config in the log looks okay to me. coreos/bugs#2026 was caused by the system timing out waiting for Ignition, but this issue is different; Ignition times out waiting for the newly-created var
partition to show up. It might be a kernel or udev issue. It's suggestive that you don't see this every time.
Is there a previous FCOS version where this didn't occur?
Its hard for me to come up with a really robust answer to this but I do not recall seeing boot failures of the VMs that often six months ago. I'm currently testing with 37.20230110.3.1 but it might take a while until this reproduces.
Do you think the 90s timeout is reasonable here or might it be wise to increase or altogether remove this timeout? The MR for the mentioned issue did remove another timeout which might also be reasonable in this case since its pretty much dependent on the environment how much this step might take (correct?).
In that other issue, filesystem creation was known to take a long time, so it didn't make sense to have a short timeout. In this case we're just waiting for a partition reprobe to propagate through udev. It's theoretically possible that a longer timeout would fix this, but I suspect that the device node wouldn't show up if we waited any length of time.
Moving this to the FCOS tracker on suspicion of a kernel/udev problem.
Unfortunately, this still reproduces in our environment - now using 37.20230218.3.0. It seems to appear about every one in ten times a host is privisioned.
Do you have any hint how to further track this down?
Still reproduces with 38.20230709.3.0
Did you ever find a fix for this problem?
We actually noticed a BC-break in XFS released ~2 years back that causes formatting attempts of partitions with less than 300MB to fail. We actually were below that threshold for the test scenarios in our CI pipeline. After increasing the size, this has not yet happened again.
Okay, thanks for your feedback.
Bug
We get occasional boot failures due to a device timeout on createLuks:
This might be related to https://github.com/coreos/bugs/issues/2026 as it appears to be the same 90s timeout. However I'm not sure how to increase/deactivate that timeout as the failing step is not yet the mounting step but the luks volume creation.
Operating System Version
Fedora CoreOS 37.20221211.3.0
Ignition Version
2.14.0
Environment
Running CoreOS as .ova VM template on ESXi / vSphere 7.0.3
Other Information
vm-logs.txt