Closed RyzeNGrind closed 1 year ago
It looks like some corruption happened to the dqlite data files perhaps due to an ungraceful shutdown.
Mar 22 02:28:37 calm-fox microk8s.daemon-k8s-dqlite[50658]: time="2022-03-22T02:28:37Z" level=fatal msg="Failed to start server: start node: raft_start(): io: load closed segment 0000000010466626-0000000010467281: entries batch 594 starting at byte 7573496: entries count 7237124267461211499 in preamble is too high\n"
I completely forgot how to resolve this particular issue. I vaguely recall deleting some files and starting dqlite will suffice. @ktsakalozos or @mathieubordere @neoaggelos maybe able to help.
Hi, can you run ls -alh /var/snap/microk8s/current/var/kubernetes/backend
?
Most of the time removing the offending segment file will get you up and running again, please make sure to stop microk8s and make a backup of /var/snap/microk8s/current/var/kubernetes/backend
before removing any files.
ryzengrind@calm-fox:~$ ls -alh /var/snap/microk8s/current/var/kubernetes/backend
total 176M
drwxrwx--- 2 root microk8s 12K Mar 23 19:02 .
drwxr-xr-x 3 root root 4.0K Mar 6 04:39 ..
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:09 0000000010466626-0000000010467281
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:10 0000000010467282-0000000010467920
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:11 0000000010467921-0000000010468589
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:11 0000000010468590-0000000010469272
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:12 0000000010469273-0000000010469926
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:13 0000000010469927-0000000010470586
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:14 0000000010470587-0000000010471239
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:14 0000000010471240-0000000010471847
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:15 0000000010471848-0000000010472221
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:16 0000000010472222-0000000010472579
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:17 0000000010472580-0000000010473062
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:18 0000000010473063-0000000010473735
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:19 0000000010473736-0000000010474397
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:19 0000000010474398-0000000010475050
-rw-rw---- 1 root microk8s 1.9K Mar 6 04:39 cluster.crt
-rw-rw---- 1 root microk8s 3.2K Mar 6 04:39 cluster.key
-rw-rw---- 1 root microk8s 205 Mar 18 05:19 cluster.yaml
-rw-rw-r-- 1 root microk8s 2 Mar 23 19:01 failure-domain
-rw-rw---- 1 root microk8s 62 Mar 6 04:46 info.yaml
srw-rw---- 1 root microk8s 0 Mar 18 05:03 kine.sock:12379
-rw-rw---- 1 root microk8s 68 Mar 18 05:03 localnode.yaml
-rw-rw---- 1 root microk8s 32 Mar 18 05:03 metadata1
-rw-rw---- 1 root microk8s 32 Mar 18 05:03 metadata2
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:19 open-19
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:14 open-20
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:14 open-21
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:16 open-22
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:16 open-23
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:18 open-24
-rw-rw---- 1 root microk8s 8.0M Mar 18 05:18 open-25
-rw-rw---- 1 root microk8s 4.0M Mar 18 05:19 snapshot-6-10473895-1089566
-rw-rw---- 1 root microk8s 136 Mar 18 05:19 snapshot-6-10473895-1089566.meta
-rw-rw---- 1 root microk8s 4.0M Mar 18 05:19 snapshot-6-10474919-1124378
-rw-rw---- 1 root microk8s 136 Mar 18 05:19 snapshot-6-10474919-1124378.meta
Thank you for the prompt response. How do I find the offending file segment? Doesn't seem obvious to me here unless I missed something.
can you stop microk8s, make a backup of that folder, remove 0000000010466626-0000000010467281
and try to restart?
Hey sorry to reopen this but this same issue happened again. I think this is the root cause, im not entirely too sure why this happened. I stepped away from my computer to attend to some other things and when i came back this is what I saw on my terminal.
Would you advise I repeat previous steps? Also anything I can do proactively to prevent this from happening again?
When I looked up this error Ask Ubuntu tells me that I need to update the firmware of my SSD but I am pretty sure I have already activated UAS for the USB-C NVMe enclosure.
Could this perhaps be related to the extra kernel modules that are used on Ubuntu 21.10 to ease disk pressure from the kubernetes usage?
Would appreciate any insights.
There are some interesting messages in dmesg
as well:
...
[11699.243951] sd 1:0:0:0: [sda] tag#18 CDB: Write(10) 2a 00 01 0b 55 90 00 00 08 00
[11699.259926] scsi host1: uas_eh_device_reset_handler start
[11699.388640] usb 3-2: reset SuperSpeed USB device number 3 using xhci-hcd
[11699.411336] scsi host1: uas_eh_device_reset_handler success
[11735.596149] sd 1:0:0:0: [sda] tag#8 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
[11735.596167] sd 1:0:0:0: [sda] tag#8 CDB: Write(10) 2a 00 1d 17 b0 28 00 00 f8 00
[11750.956237] sd 1:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 12 inflight: CMD
[11750.956252] sd 1:0:0:0: [sda] tag#26 CDB: Write(10) 2a 00 20 94 44 28 00 00 08 00
[11750.956262] sd 1:0:0:0: [sda] tag#25 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD
[11750.956267] sd 1:0:0:0: [sda] tag#25 CDB: Write(10) 2a 00 20 94 31 68 00 00 08 00
[11750.956274] sd 1:0:0:0: [sda] tag#24 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD
[11750.956279] sd 1:0:0:0: [sda] tag#24 CDB: Write(10) 2a 00 20 94 0c 70 00 00 08 00
[11750.956286] sd 1:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD
....
Maybe [1] is relative.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Attached Inspection Report inspection-report-20220322_023344.tar.gz