Open Pluies opened 10 months ago
Update, Instance Connect endpoint didn't work out.
Also I didn't mention it in the original report, if that's any help we're provisioning m7g.8xlarge
instances.
I've managed to get some logs 🥳 I took snapshots from the EBS volumes of the broken instance, recreated volumes from them, attached them to a new instance, and we're off to the races. 👍
The systemd unit that failed isn't very helpful log-wise:
[ec2-user@ip-10-0-155-70 vol1]$ sudo journalctl --root /vol1/ -u systemd-networkd-wait-online.service
Jan 23 10:47:53 localhost systemd[1]: Starting Wait for Network to be Configured...
Jan 23 10:49:53 localhost systemd-networkd-wait-online[2582]: Timeout occurred while waiting for network connectivity.
Jan 23 10:49:53 localhost systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 10:49:53 localhost systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Jan 23 10:49:53 localhost systemd[1]: Failed to start Wait for Network to be Configured.
I'm attaching the full journalctl
output from boot:
journalctl.log
Hello @Pluies, thanks for cutting this issue! These failures are interesting and we are going to take a look. Thanks for attaching the journalctl.log
which has some useful errors to dive into:
Jan 23 10:50:01 localhost kernel: ena 0000:00:05.0 eth0: TX hasn't completed, qid 2, index 13. 127850 msecs since last interrupt, 127850 msecs since last napi execution, napi scheduled: 0
We'll update the issue when we know more. In the meantime, can you confirm how many instances you have seen this happen on and when it first happened? Did you ever see this type of failure on previous Bottlerocket releases?
Hi @yeazelm , thanks for having a look! :)
In the meantime, can you confirm how many instances you have seen this happen on and when it first happened?
Sure! This has happened 7 times in the past 24hrs:
{"level":"DEBUG","time":"2024-01-23T11:02:44.635Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-qshsb","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T10:46:45.896Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-6n8nn","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T09:26:41.677Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-4k6tm","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T04:36:35.005Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-rsdc7","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T02:15:55.465Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-bbnpz","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-23T00:28:45.271Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-bfllb","nodepool":"trino","ttl":"15m0s"}
{"level":"DEBUG","time":"2024-01-22T21:38:18.128Z","logger":"controller.nodeclaim.lifecycle","message":"terminating due to registration ttl","commit":"50f60fb","nodeclaim":"trino-jzzzh","nodepool":"trino","ttl":"15m0s"}
In the same timespan we provisioned 1028 machines.
Did you ever see this type of failure on previous Bottlerocket releases?
Looking back, similar errors have happened in the past at much lower rates and we didn't notice it. I can't confirm whether this was the exact same issue, but we've had the same terminating due to registration ttl
karpenter message...
That's it for the Oct 23rd - Jan 23rd period (we keep 90 days of logs).
Thanks for the quick update @Pluies ! That frequency is super helpful in trying to find what might be the cause.
I'm not sure whether this is just a result of the above or a contributing factor, but we're seeing the following error message in the AWS dashboard:
Instance status checks
Instance reachability check failed
Check failure at
2024/01/23 11:50 GMT+1 (1 day)
I'm now going to kill the node above (it's not cheap 😅), which should be fine given we have the volumes + logs.
The issue hasn't reappeared for the past 24h.
@Pluies can you please tell me the region you are using?
@shaharitzko we're running in eu-west-1 (Ireland), this specific instance was in euw1-az3
@Pluies Could you cut a ticket to AWS support and include the instance ids of the failures you know about and refer to this issue? I think that might help us dig in more.
Hi @yeazelm , I just opened a ticket with the machine IDs from yesterday, hopefully this will shed some light on the issue!
Update from AWS:
Just as a precaution I have opened an internal investigation into the Bottleneck AMI related network error with the EKS service team and they are looking into a possible kernel bug. Please let me know if you see any reocurrences of the same problem.
We've not seen this happen again since Jan 26th fwiw, I'll update this thread with results of the ticket above. 👍
Just a quick update. I did hear from that internal investigation that they still investigating, but I wanted to confirm with @Pluies if you are still experiencing this or if it stopped after Jan 26th?
The issue happens much less often, but we have seen it again on:
Same issue on EKS 1.27 Bottlerocket version: 1.23.0
Userdata:
[settings]
[settings.kubernetes]
api-server = '<<>>'
cluster-certificate = '<<>>'
cluster-name = '<<>>'
kube-api-qps = 30
image-gc-high-threshold-percent = 80
image-gc-low-threshold-percent = 70
[settings.kubernetes.node-labels]
dedicated = 'pipeline-default'
'karpenter.sh/capacity-type' = 'on-demand'
'karpenter.sh/provisioner-name' = 'pipeline-default'
[settings.kubernetes.node-taints]
pipeline-default = [':NoSchedule']
[settings.kubernetes.eviction-hard]
'memory.available' = '10%'
[settings.kubernetes.system-reserved]
cpu = '100m'
ephemeral-storage = '1Gi'
memory = '100Mi'
Logs:
[H[J[1;1H[H[J[1;1H[H[J[1;1H Booting `Bottlerocket OS 1.23.0'
Welcome to Bottlerocket OS 1.23.0 (aws-k8s-1.27)!
[ OK ] Created slice Slice /system/modprobe.
[ OK ] Created slice User and Session Slice.
Expecting device /dev/disk/by-partlabel/BOTTLEROCKET-DATA...
Expecting device /dev/disk/by-partlabel/BOTTLEROCKET-PRIVATE...
Expecting device /dev/disk/by-partuâ¦8df-28b8-485c-9d19-362263b5944c...
Expecting device /dev/disk/by-partuâ¦874-417d-4e26-a764-7885f22007ea...
[ OK ] Reached target Path Units.
[ OK ] Reached target Slice Units.
[ OK ] Reached target Swaps.
[ OK ] Listening on Journal Audit Socket.
[ OK ] Listening on Journal Socket (/dev/log).
[ OK ] Listening on Journal Socket.
[ OK ] Listening on udev Control Socket.
[ OK ] Listening on udev Kernel Socket.
Mounting Huge Pages File System...
Mounting POSIX Message Queue File System...
Mounting CNI Configuration Directory (/etc/cni)...
Mounting Kernel Debug File System...
Mounting Kernel Trace File System...
Mounting Temporary Directory /tmp...
Starting Load audit rules...
Starting Checks and marks if boot has ever succeeded before...
Starting Create List of Static Device Nodes...
Starting Load Kernel Module configfs...
Starting Load Kernel Module drm...
Starting Load Kernel Module efi_pstore...
Starting Load Kernel Module fuse...
Starting Prepare Boot Directory (/boot)...
Starting Copy SELinux policy files...
Starting Journal Service...
Starting Load Kernel Modules...
Starting Generate network units from Kernel command line...
Starting Remount Root and Kernel File Systems...
Starting Coldplug All udev Devices...
[ OK ] Mounted Huge Pages File System.
[ OK ] Mounted POSIX Message Queue File System.
[ OK ] Mounted CNI Configuration Directory (/etc/cni).
[ OK ] Mounted Kernel Debug File System.
[ OK ] Mounted Kernel Trace File System.
[ OK ] Mounted Temporary Directory /tmp.
[ OK ] Finished Load audit rules.
[ OK ] Finished Checks and marks if boot has ever succeeded before.
[ OK ] Finished Create List of Static Device Nodes.
[ OK ] Finished Load Kernel Module configfs.
[ OK ] Finished Load Kernel Module efi_pstore.
[ OK ] Finished Load Kernel Module fuse.
[ OK ] Finished Generate network units from Kernel command line.
[ OK ] Finished Remount Root and Kernel File Systems.
14:13:42 [INFO] Mounting /dev/nvme0n1p3 in /boot
Mounting FUSE Control File System...
Mounting Kernel Configuration File System...
[ OK ] Finished Coldplug All udev Devices.
[ OK ] Finished Prepare Boot Directory (/boot).
[ OK ] Finished Copy SELinux policy files.
[ OK ] Mounted FUSE Control File System.
[ OK ] Mounted Kernel Configuration File System.
Mounting Containerd Configuration Directory (/etc/containerd)...
Mounting Host containers Configuratâ¦irectory (/etc/host-containers)...
Mounting Kubernetes PKI private dirâ¦y (/etc/kubernetes/pki/private)...
Mounting AWS configuration directory (/root/.aws)...
Mounting Ephemeral netdog configuration directory...
Starting Create System Users...
[ OK ] Started Journal Service.
[ OK ] Finished Load Kernel Module drm.
[ OK ] Finished Load Kernel Modules.
[ OK ] Mounted Containerd Configuration Directory (/etc/containerd).
[ OK ] Mounted Host containers Configurati⦠Directory (/etc/host-containers).
[ OK ] Mounted Kubernetes PKI private direâ¦ory (/etc/kubernetes/pki/private).
[ OK ] Mounted AWS configuration directory (/root/.aws).
[ OK ] Mounted Ephemeral netdog configuration directory.
[ OK ] Finished Create System Users.
Starting Apply Kernel Variables...
Starting Create Static Device Nodes in /dev...
[ OK ] Finished Apply Kernel Variables.
[ OK ] Finished Create Static Device Nodes in /dev.
[ OK ] Reached target Preparation for Local File Systems.
Starting Rule-based Manager for Device Events and Files...
[ OK ] Started Rule-based Manager for Device Events and Files.
[* ] (1 of 4) A start job is running forâ¦-a764-7885f22007ea (2s / 1min 30s)
M
[K[** ] (1 of 4) A start job is running forâ¦-a764-7885f22007ea (2s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦-a764-7885f22007ea (3s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦/BOTTLEROCKET-DATA (3s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦/BOTTLEROCKET-DATA (4s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦/BOTTLEROCKET-DATA (4s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦-9d19-362263b5944c (5s / 1min 30s)
M
[K[ *] (3 of 4) A start job is running forâ¦-9d19-362263b5944c (5s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦-9d19-362263b5944c (6s / 1min 30s)
M
[K[ ***] (4 of 4) A start job is running forâ¦TTLEROCKET-PRIVATE (6s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦TTLEROCKET-PRIVATE (7s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦TTLEROCKET-PRIVATE (7s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦-a764-7885f22007ea (8s / 1min 30s)
M
[K[** ] (1 of 4) A start job is running forâ¦-a764-7885f22007ea (8s / 1min 30s)
M
[K[* ] (1 of 4) A start job is running forâ¦-a764-7885f22007ea (9s / 1min 30s)
M
[K[** ] (2 of 4) A start job is running forâ¦/BOTTLEROCKET-DATA (9s / 1min 30s)
M
[K[*** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (10s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (10s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (11s / 1min 30s)
M
[K[ ***] (3 of 4) A start job is running forâ¦9d19-362263b5944c (11s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦9d19-362263b5944c (12s / 1min 30s)
M
[K[ *] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (12s / 1min 30s)
M
[K[ **] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (13s / 1min 30s)
M
[K[ ***] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (13s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (14s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (14s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (15s / 1min 30s)
M
[K[** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (15s / 1min 30s)
M
[K[* ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (16s / 1min 30s)
M
[K[** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (16s / 1min 30s)
M
[K[*** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (17s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (17s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (18s / 1min 30s)
M
[K[ ***] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (18s / 1min 30s)
M
[K[ **] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (19s / 1min 30s)
M
[K[ *] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (19s / 1min 30s)
M
[K[ **] (1 of 4) A start job is running forâ¦a764-7885f22007ea (20s / 1min 30s)
M
[K[ ***] (1 of 4) A start job is running forâ¦a764-7885f22007ea (20s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (21s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (21s / 1min 30s)
M
[K[*** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (22s / 1min 30s)
M
[K[** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (22s / 1min 30s)
M
[K[* ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (23s / 1min 30s)
M
[K[** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (23s / 1min 30s)
M
[K[*** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (24s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (24s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (25s / 1min 30s)
M
[K[ ***] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (25s / 1min 30s)
M
[K[ **] (1 of 4) A start job is running forâ¦a764-7885f22007ea (26s / 1min 30s)
M
[K[ *] (1 of 4) A start job is running forâ¦a764-7885f22007ea (26s / 1min 30s)
M
[K[ **] (1 of 4) A start job is running forâ¦a764-7885f22007ea (27s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (27s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (28s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (28s / 1min 30s)
M
[K[*** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (29s / 1min 30s)
M
[K[** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (29s / 1min 30s)
M
[K[* ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (30s / 1min 30s)
M
[K[** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (30s / 1min 30s)
M
[K[*** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (31s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (31s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (32s / 1min 30s)
M
[K[ ***] (1 of 4) A start job is running forâ¦a764-7885f22007ea (32s / 1min 30s)
M
[K[ **] (1 of 4) A start job is running forâ¦a764-7885f22007ea (33s / 1min 30s)
M
[K[ *] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (33s / 1min 30s)
M
[K[ **] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (34s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (34s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (35s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (35s / 1min 30s)
M
[K[*** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (36s / 1min 30s)
M
[K[** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (36s / 1min 30s)
M
[K[* ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (37s / 1min 30s)
M
[K[** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (37s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (38s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (38s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (39s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (39s / 1min 30s)
M
[K[ **] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (40s / 1min 30s)
M
[K[ *] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (40s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦9d19-362263b5944c (41s / 1min 30s)
M
[K[ ***] (3 of 4) A start job is running forâ¦9d19-362263b5944c (41s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (42s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (42s / 1min 30s)
M
[K[*** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (43s / 1min 30s)
M
[K[** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (43s / 1min 30s)
M
[K[* ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (44s / 1min 30s)
M
[K[** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (44s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (45s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (45s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (46s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (46s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦9d19-362263b5944c (47s / 1min 30s)
M
[K[ *] (3 of 4) A start job is running forâ¦9d19-362263b5944c (47s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦9d19-362263b5944c (48s / 1min 30s)
M
[K[ ***] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (48s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (49s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (49s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (50s / 1min 30s)
M
[K[** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (50s / 1min 30s)
M
[K[* ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (51s / 1min 30s)
M
[K[** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (51s / 1min 30s)
M
[K[*** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (52s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (52s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (53s / 1min 30s)
M
[K[ ***] (3 of 4) A start job is running forâ¦9d19-362263b5944c (53s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦9d19-362263b5944c (54s / 1min 30s)
M
[K[ *] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (54s / 1min 30s)
M
[K[ **] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (55s / 1min 30s)
M
[K[ ***] (4 of 4) A start job is running forâ¦TLEROCKET-PRIVATE (55s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (56s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (56s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦a764-7885f22007ea (57s / 1min 30s)
M
[K[** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (57s / 1min 30s)
M
[K[* ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (58s / 1min 30s)
M
[K[** ] (2 of 4) A start job is running forâ¦BOTTLEROCKET-DATA (58s / 1min 30s)
M
[K[*** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (59s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦9d19-362263b5944c (59s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦d19-362263b5944c (1min / 1min 30s)
M
[K[ ***] (4 of 4) A start job is running forâ¦LEROCKET-PRIVATE (1min / 1min 30s)
M
[K[ **] (4 of 4) A start job is running forâ¦OCKET-PRIVATE (1min 1s / 1min 30s)
M
[K[ *] (4 of 4) A start job is running forâ¦OCKET-PRIVATE (1min 1s / 1min 30s)
M
[K[ **] (1 of 4) A start job is running forâ¦-7885f22007ea (1min 2s / 1min 30s)
M
[K[ ***] (1 of 4) A start job is running forâ¦-7885f22007ea (1min 2s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦-7885f22007ea (1min 3s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦LEROCKET-DATA (1min 3s / 1min 30s)
M
[K[*** ] (2 of 4) A start job is running forâ¦LEROCKET-DATA (1min 4s / 1min 30s)
M
[K[** ] (2 of 4) A start job is running forâ¦LEROCKET-DATA (1min 4s / 1min 30s)
M
[K[* ] (3 of 4) A start job is running forâ¦-362263b5944c (1min 5s / 1min 30s)
M
[K[** ] (3 of 4) A start job is running forâ¦-362263b5944c (1min 5s / 1min 30s)
M
[K[*** ] (3 of 4) A start job is running forâ¦-362263b5944c (1min 6s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦OCKET-PRIVATE (1min 6s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦OCKET-PRIVATE (1min 7s / 1min 30s)
M
[K[ ***] (4 of 4) A start job is running forâ¦OCKET-PRIVATE (1min 7s / 1min 30s)
M
[K[ **] (1 of 4) A start job is running forâ¦-7885f22007ea (1min 8s / 1min 30s)
M
[K[ *] (1 of 4) A start job is running forâ¦-7885f22007ea (1min 8s / 1min 30s)
M
[K[ **] (1 of 4) A start job is running forâ¦-7885f22007ea (1min 9s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦LEROCKET-DATA (1min 9s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 10s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 10s / 1min 30s)
M
[K[*** ] (3 of 4) A start job is running forâ¦362263b5944c (1min 11s / 1min 30s)
M
[K[** ] (3 of 4) A start job is running forâ¦362263b5944c (1min 11s / 1min 30s)
M
[K[* ] (3 of 4) A start job is running forâ¦362263b5944c (1min 12s / 1min 30s)
M
[K[** ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 12s / 1min 30s)
M
[K[*** ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 13s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 13s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦7885f22007ea (1min 14s / 1min 30s)
M
[K[ ***] (1 of 4) A start job is running forâ¦7885f22007ea (1min 14s / 1min 30s)
M
[K[ **] (1 of 4) A start job is running forâ¦7885f22007ea (1min 15s / 1min 30s)
M
[K[ *] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 15s / 1min 30s)
M
[K[ **] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 16s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 16s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦362263b5944c (1min 17s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦362263b5944c (1min 17s / 1min 30s)
M
[K[*** ] (3 of 4) A start job is running forâ¦362263b5944c (1min 18s / 1min 30s)
M
[K[** ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 18s / 1min 30s)
M
[K[* ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 19s / 1min 30s)
M
[K[** ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 19s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦7885f22007ea (1min 20s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦7885f22007ea (1min 20s / 1min 30s)
M
[K[ *** ] (1 of 4) A start job is running forâ¦7885f22007ea (1min 21s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 21s / 1min 30s)
M
[K[ **] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 22s / 1min 30s)
M
[K[ *] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 22s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦362263b5944c (1min 23s / 1min 30s)
M
[K[ ***] (3 of 4) A start job is running forâ¦362263b5944c (1min 23s / 1min 30s)
M
[K[ *** ] (3 of 4) A start job is running forâ¦362263b5944c (1min 24s / 1min 30s)
M
[K[ *** ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 24s / 1min 30s)
M
[K[*** ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 25s / 1min 30s)
M
[K[** ] (4 of 4) A start job is running forâ¦CKET-PRIVATE (1min 25s / 1min 30s)
M
[K[* ] (1 of 4) A start job is running forâ¦7885f22007ea (1min 26s / 1min 30s)
M
[K[** ] (1 of 4) A start job is running forâ¦7885f22007ea (1min 26s / 1min 30s)
M
[K[*** ] (1 of 4) A start job is running forâ¦7885f22007ea (1min 27s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 27s / 1min 30s)
M
[K[ *** ] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 28s / 1min 30s)
M
[K[ ***] (2 of 4) A start job is running forâ¦EROCKET-DATA (1min 28s / 1min 30s)
M
[K[ **] (3 of 4) A start job is running forâ¦362263b5944c (1min 29s / 1min 30s)
M
[K[ *] (3 of 4) A start job is running forâ¦362263b5944c (1min 29s / 1min 30s)
M
[K[ TIME ] Timed out waiting for device /dev/disk/by-partlabel/BOTTLEROCKET-DATA.
[K[DEPEND] Dependency failed for Local Directory (/local).
[DEPEND] Dependency failed for Mask Local Var Directory (/local/var).
[DEPEND] Dependency failed for Var Directory (/var).
[DEPEND] Dependency failed for User Login Management.
[DEPEND] Dependency failed for Bootstrap Commands.
[DEPEND] Dependency failed for Bottlerocket initial configuration complete.
[DEPEND] Dependency failed for Isolates configured.target.
[DEPEND] Dependency failed for D-Bus System Message Bus.
[DEPEND] Dependency failed for wicked DHCPv6 supplicant service.
[DEPEND] Dependency failed for wicked network management service daemon.
[DEPEND] Dependency failed for wicked DHCPv4 supplicant service.
[DEPEND] Dependency failed for wicked network nanny service.
[DEPEND] Dependency failed for Load/Save Random Seed.
[DEPEND] Dependency failed for Platform Persistent Storage Archival.
[DEPEND] Dependency failed for Private Directory (/var/lib/bottlerocket).
[DEPEND] Dependency failed for Prepare Var Directory (/var).
[DEPEND] Dependency failed for CNI Plugin Directory (/opt/cni).
[DEPEND] Dependency failed for Flush Journal to Persistent Storage.
[DEPEND] Dependency failed for Kernel Modules (Read-Write).
[DEPEND] Dependency failed for Kernel Development Sources (Read-Write).
[DEPEND] Dependency failed for Prepare Kubelet Directory (/var/lib/kubelet).
[DEPEND] Dependency failed for Kernel Development Sources (Read-Only).
[DEPEND] Dependency failed for Basic System.
[DEPEND] Dependency failed for Prepare Contaâ¦d Directory (/var/lib/containerd).
[DEPEND] Dependency failed for CSI Helper Directory (/opt/csi).
[DEPEND] Dependency failed for Mnt Directory (/mnt).
[DEPEND] Dependency failed for Mask Local Mnt Directory (/local/mnt).
[DEPEND] Dependency failed for Opt Directory (/opt).
[DEPEND] Dependency failed for Mask Local Opt Directory (/local/opt).
[DEPEND] Dependency failed for Prepare Opt Directory (/opt).
[DEPEND] Dependency failed for Resize Data Partition.
[ TIME ] Timed out waiting for device /dev/dâ¦4e8df-28b8-485c-9d19-362263b5944c.
[ TIME ] Timed out waiting for device /dev/dâ¦40874-417d-4e26-a764-7885f22007ea.
[ OK ] Reached target First Boot Complete.
Starting Prepare Local Filesystem (/local)...
Starting Repart fallback data partition...
Starting Repart preferred data partition...
[FAILED] Failed to start Prepare Local Filesystem (/local).
See 'systemctl status prepare-local-fs.service' for details.
[ OK ] Reached target Local File Systems.
Mounting License files...
Starting Commit a transient machine-id on disk...
Starting Create Volatile Files and Directories...
[ OK ] Finished Repart fallback data partition.
[ OK ] Finished Repart preferred data partition.
[ OK ] Finished Commit a transient machine-id on disk.
[ OK ] Mounted License files.
[ OK ] Finished Create Volatile Files and Directories.
Starting Rebuild Dynamic Linker Cache...
[ OK ] Finished Rebuild Dynamic Linker Cache.
Starting Update is Completed...
[ OK ] Finished Update is Completed.
[ OK ] Reached target System Initialization.
[ OK ] Started Scheduled Metricdog Pings.
[ OK ] Started Daily Cleanup of Temporary Directories.
[ OK ] Reached target Timer Units.
[ OK ] Listening on D-Bus System Message Bus Socket.
[ OK ] Reached target Socket Units.
Starting ACPI event daemon...
Starting Generate network configuration...
Starting Disable kexec load syscalls...
Starting Bottlerocket data store migrator...
[ OK ] Started ACPI event daemon.
[ OK ] Finished Disable kexec load syscalls.
[ 91.790761] netdog[454]: Failed to write primary interface to '/var/lib/netdog/primary_interface': No such file or directory (os error 2)
[FAILED] Failed to start Generate network configuration.
See 'systemctl status generate-network-config.service' for details.
[DEPEND] Dependency failed for Preparation for Network.
[ 91.795630] migrator[457]: Data store does not exist at given path, exiting (/var/lib/bottlerocket/datastore/current)
Starting wicked managed network interfaces...
[ OK ] Finished Bottlerocket data store migrator.
Starting Call signpost to mark the â¦r all required targets are met....
Starting Datastore creator...
[ 91.809841] storewolf[477]: Unable to create datastore: Unable to create directory at '/var/lib/bottlerocket/datastore': Read-only file system (os error 30)
[FAILED] Failed to start Datastore creator.
See 'systemctl status storewolf.service' for details.
[DEPEND] Dependency failed for Applies settings to create config files.
[DEPEND] Dependency failed for Send signal to CloudFormation Stack.
[DEPEND] Dependency failed for Sets the hostname.
[DEPEND] Dependency failed for Bottlerocket userdata configuration system.
[DEPEND] Dependency failed for User-specified setting generators.
[DEPEND] Dependency failed for Generate additional settings for Kubernetes.
[DEPEND] Dependency failed for Bottlerocket API server.
[FAILED] Failed to start wicked managed network interfaces.
See 'systemctl status wicked.service' for details.
[ OK ] Reached target Network.
[ OK ] Reached target Network is Online.
[ OK ] Finished Call signpost to mark the â¦ter all required targets are met..
@elebiodaslingshot your error looks to be a different one from this issue:
[ TIME ] Timed out waiting for device /dev/d�4e8df-28b8-485c-9d19-362263b5944c.
[ TIME ] Timed out waiting for device /dev/d�40874-417d-4e26-a764-7885f22007ea.
This looks to be an issue with the second EBS volume, the data volume, being available on boot. Can you check if there were issues with that volume and if it is occurring more than once, can you cut a new issue to keep these two separate? Thanks!
Hey folks,
We've been running our EKS cluster with Bottlerocket nodes provisioned by Karpenter for a while now, with no issue whatsoever - it's been rock solid :)
However, last night we hit an issue with Pods staying Pending for too long. We investigated and realised Karpenter correctly created a NodeClaim and started up an EC2 instance, but this new instance failed to boot and join the cluster. After 15 minutes, Karpenter realised something was wrong, deleted that node and tried again with a new node.
We thought it was only a transient issue at first, but it happened again this morning, at which point I grabbed an instance and enabled termination protection to keep it alive for debugging.
The startup logs show:
The main difference I can see between this and a healthy instance is the
[FAILED] Failed to start Wait for Network to be Configured.
log line, which doesn't appear on a healthy instance.I'm unsure how to debug this further - currently trying to get an Instance Connect endpoint in that subnet, but I'm not sure it'll work given the instance seems to not have a working network. I can keep it for a few days if there's any debugging I can run to help pinpoint the root cause here :)
Image I'm using: ami-07d1105485cff7781
bottlerocket-aws-k8s-1.28-aarch64-v1.18.0-7452c37e EKS 1.28 Karpenter 0.32
What I expected to happen: instance starts up and join EKS cluster
What actually happened: instance fails to start up, and doesn't join the cluster
How to reproduce the problem: I can't reproduce this reliably. It started happening on 2024-01-22 around 10pm UTC and has happened 7 times since, but we've also reliably provisioned hundred of similar instances in the same subnets with the same AMI in this timeframe, so it's only intermittent.