canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.45k stars 770 forks source link

Master node crashed upon reboot #2997

Closed RyzeNGrind closed 1 year ago

RyzeNGrind commented 2 years ago

Attached Inspection Report inspection-report-20220322_023344.tar.gz

balchua commented 2 years ago

It looks like some corruption happened to the dqlite data files perhaps due to an ungraceful shutdown.

Mar 22 02:28:37 calm-fox microk8s.daemon-k8s-dqlite[50658]: time="2022-03-22T02:28:37Z" level=fatal msg="Failed to start server: start node: raft_start(): io: load closed segment 0000000010466626-0000000010467281: entries batch 594 starting at byte 7573496: entries count 7237124267461211499 in preamble is too high\n"

I completely forgot how to resolve this particular issue. I vaguely recall deleting some files and starting dqlite will suffice. @ktsakalozos or @mathieubordere @neoaggelos maybe able to help.

MathieuBordere commented 2 years ago

Hi, can you run ls -alh /var/snap/microk8s/current/var/kubernetes/backend? Most of the time removing the offending segment file will get you up and running again, please make sure to stop microk8s and make a backup of /var/snap/microk8s/current/var/kubernetes/backend before removing any files.

RyzeNGrind commented 2 years ago
ryzengrind@calm-fox:~$ ls -alh /var/snap/microk8s/current/var/kubernetes/backend
total 176M
drwxrwx--- 2 root microk8s  12K Mar 23 19:02 .
drwxr-xr-x 3 root root     4.0K Mar  6 04:39 ..
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:09 0000000010466626-0000000010467281
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:10 0000000010467282-0000000010467920
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:11 0000000010467921-0000000010468589
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:11 0000000010468590-0000000010469272
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:12 0000000010469273-0000000010469926
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:13 0000000010469927-0000000010470586
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:14 0000000010470587-0000000010471239
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:14 0000000010471240-0000000010471847
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:15 0000000010471848-0000000010472221
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:16 0000000010472222-0000000010472579
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:17 0000000010472580-0000000010473062
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:18 0000000010473063-0000000010473735
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:19 0000000010473736-0000000010474397
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:19 0000000010474398-0000000010475050
-rw-rw---- 1 root microk8s 1.9K Mar  6 04:39 cluster.crt
-rw-rw---- 1 root microk8s 3.2K Mar  6 04:39 cluster.key
-rw-rw---- 1 root microk8s  205 Mar 18 05:19 cluster.yaml
-rw-rw-r-- 1 root microk8s    2 Mar 23 19:01 failure-domain
-rw-rw---- 1 root microk8s   62 Mar  6 04:46 info.yaml
srw-rw---- 1 root microk8s    0 Mar 18 05:03 kine.sock:12379
-rw-rw---- 1 root microk8s   68 Mar 18 05:03 localnode.yaml
-rw-rw---- 1 root microk8s   32 Mar 18 05:03 metadata1
-rw-rw---- 1 root microk8s   32 Mar 18 05:03 metadata2
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:19 open-19
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:14 open-20
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:14 open-21
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:16 open-22
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:16 open-23
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:18 open-24
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:18 open-25
-rw-rw---- 1 root microk8s 4.0M Mar 18 05:19 snapshot-6-10473895-1089566
-rw-rw---- 1 root microk8s  136 Mar 18 05:19 snapshot-6-10473895-1089566.meta
-rw-rw---- 1 root microk8s 4.0M Mar 18 05:19 snapshot-6-10474919-1124378
-rw-rw---- 1 root microk8s  136 Mar 18 05:19 snapshot-6-10474919-1124378.meta

Thank you for the prompt response. How do I find the offending file segment? Doesn't seem obvious to me here unless I missed something.

MathieuBordere commented 2 years ago

can you stop microk8s, make a backup of that folder, remove 0000000010466626-0000000010467281 and try to restart?

RyzeNGrind commented 2 years ago

Hey sorry to reopen this but this same issue happened again. I think this is the root cause, im not entirely too sure why this happened. I stepped away from my computer to attend to some other things and when i came back this is what I saw on my terminal.

Would you advise I repeat previous steps? Also anything I can do proactively to prevent this from happening again?

Screenshot 2022-03-26 004230

image

When I looked up this error Ask Ubuntu tells me that I need to update the firmware of my SSD but I am pretty sure I have already activated UAS for the USB-C NVMe enclosure.

image

Could this perhaps be related to the extra kernel modules that are used on Ubuntu 21.10 to ease disk pressure from the kubernetes usage?

Would appreciate any insights.

ktsakalozos commented 2 years ago

There are some interesting messages in dmesg as well:

...
[11699.243951] sd 1:0:0:0: [sda] tag#18 CDB: Write(10) 2a 00 01 0b 55 90 00 00 08 00
[11699.259926] scsi host1: uas_eh_device_reset_handler start
[11699.388640] usb 3-2: reset SuperSpeed USB device number 3 using xhci-hcd
[11699.411336] scsi host1: uas_eh_device_reset_handler success
[11735.596149] sd 1:0:0:0: [sda] tag#8 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD 
[11735.596167] sd 1:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 1d 17 b0 28 00 00 f8 00
[11750.956237] sd 1:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 12 inflight: CMD 
[11750.956252] sd 1:0:0:0: [sda] tag#26 CDB: Write(10) 2a 00 20 94 44 28 00 00 08 00
[11750.956262] sd 1:0:0:0: [sda] tag#25 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD 
[11750.956267] sd 1:0:0:0: [sda] tag#25 CDB: Write(10) 2a 00 20 94 31 68 00 00 08 00
[11750.956274] sd 1:0:0:0: [sda] tag#24 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD 
[11750.956279] sd 1:0:0:0: [sda] tag#24 CDB: Write(10) 2a 00 20 94 0c 70 00 00 08 00
[11750.956286] sd 1:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD 
....

Maybe [1] is relative.

[1] https://forums.raspberrypi.com/viewtopic.php?t=303604

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.