bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.64k stars 508 forks source link

Fail to get ValidLinuxHostname with ipv6 link-local address #3571

Closed gregular closed 10 months ago

gregular commented 10 months ago

Image I'm using: I'm building a custom 1.15.1 metal-dev variant build with additional drivers turned on but I speculate this is a core issue and not specific to my build. I've seen it with previous version builds as well but believe it has only ever happened on networkd based builds not wicked.

What I expected to happen: When booting up I'm expecting to generate initial settings correctly, specifically for the hostname.

What actually happened: sundog[1606]: Error deserializing hashMap to Settings: Error deserializing scaler value: Unable to deserialize into ValidLinuxHostname: 'fe80::dea6:32ff:fea9:513c' must only be [0-9a-z.-], and 1-253 chars long

How to reproduce the problem: It doesn't happen on every initial build/boot but fairly regularly I can't generate initial settings on a clean build. Just boot and get this error and forward progress halts and my config settings seem to be corrupted from then on and I can't boot. Occasionally I can reflash/purge the config directory and beat the race condition and it gives me a hostname based on a valid IPv4 address. I don't currently have IPv6 turned on for this lan segment so I've never seen it fail with a non link-local IPv6.

gthao313 commented 10 months ago

@gregular Thanks for opening! We are looking at this issue.

gregular commented 10 months ago

Not sure if this is helpful but on a system on the same network that already has a hostname set (from a prior boot) if I run this via sheltie:

bash-5.1# netdog generate-hostname
Reverse DNS lookup failed: failed to lookup address information: Temporary failure in name resolution
"fe80::5c:a1ff:feab:1e00"

I have a valid ipv4 on the main interface but I'm getting the ipv6 link local from netdog here too.

zmrow commented 10 months ago

Thanks for the issue report @gregular !

The hostname is generated on first boot by (as you've found) the setting generator netdog generate-hostname. The way this should work is as follows:

I'm wondering if the link doesn't have a DHCP-vended address at the time we query it, which is why we end up with the link-local address.

You mentioned you're using metal-dev. Can you share the net.toml you're using and how you're running the image (qemu/metal, etc)? Depending on how that is set up, the system may not be waiting for the link to get an address before moving on.

gregular commented 10 months ago

The networking scenario I am using doesn't have a net.toml and is just using the default eth0 interface as defined on the kernel command line. However, what I speculate I'm running into here is a scenario where eth0 doesn't "plug" for an extended period of time (an example would be a USB device that is plugged in later) and so systemd-networkd actually times out and the rest of the system attempts to come up. In that scenario it looks like /var/lib/netdog/current_ip is getting an IPv6 link-local address.

So perhaps this is an issue specific to me. I'm still curious if an interface never acquires an IPv4 address at all (say I have an IPv6-only network segment) shouldn't bottlerocket still handle it? Link-local or valid IPv6 address it still looks like the regex for ValidLinuxHostname doesn't allow IPv6 addresses. The hostname generation algo should be able to fallback-generate something for an IPv6 IP like it does with an IPv4.

gregular commented 10 months ago

OK I think I have chased this down to a timeout issue as stated before. The easiest way to reproduce this is to allow a link to come up but turn off dhcpd on the network until systemd-networkd times out (I think the other way would be to let dhcp6 complete but dhcp4 fail). In that case netdog will grab the ipv6 LL addr and drop it in /var/lib/netdog/current_ip and then the system won't boot past hostname generation even though dhcp might succeed later.

I am going to workaround this issue with this patch:

diff --git a/sources/api/netdog/src/cli/generate_hostname.rs b/sources/api/netdog/src/cli/generate_hostname.rs
index ddbd8f6c..91a0d77f 100644
--- a/sources/api/netdog/src/cli/generate_hostname.rs
+++ b/sources/api/netdog/src/cli/generate_hostname.rs
@@ -58,7 +58,7 @@ pub(crate) async fn run() -> Result<()> {
         hostname
     }
     // If no hostname has been determined we return the IP address of the host.
-    .unwrap_or(ip_string);
+    .unwrap_or(ip_string.replace(".","-").replace("::","-").replace(":","-"));

     // sundog expects JSON-serialized output
     print_json(hostname)

This seems like a bug in netdog to me. As a nice side effect my hostnames now go from all 192 to something better like 192-168-100-42.

zmrow commented 10 months ago

@gregular I agree with you - and the patch seems reasonable, though I might argue leaving the dots "." rather than replacing them with dashes "-". Replacing the colons ":" is the right thing to do however.

Would you be interested in contributing this fix? If not, I'm happy to integrate something similar.

gregular commented 10 months ago

Sure I'll spin up a pull request with a test case in the next bit and see how things go. The reason that I like replacing the "." in the IP address is based on another change earlier that went in that truncates the hostname from the full IP to just the prefix if not resolve-able. So as mentioned I have a bunch of machines in the network that all come up with the hostname as 192 because my IPv4 prefix is 192.