Open eug48 opened 4 years ago
I've left one worker node in the stuck state in case that's useful for troubleshooting, and have now come across a well-known issue in that pods running on that NotReady
node are stuck as Terminating
. With prometheus this is a problem because the StatefulSet will not start another instance until that pod is manually force-deleted. Just mentioning here as this is another reason why I think microk8s is not ready for auto-refreshes.
Thank you for reporting this @eug48. I opened an issue/topic with the snap team at [1].
One note here is that you cannot hold snap refreshes forever (sudo snap set system refresh.hold=2050-01-01T15:04:05Z
does not work). You can defer refreshes for up to 90 days, I think. If you want to block refreshes you need to setup a snap store proxy [2]
[1] https://forum.snapcraft.io/t/snap-refresh-breaks-microk8s-cluster/15906 [2] https://docs.ubuntu.com/snap-store-proxy/en/
Thanks very much for raising that @ktsakalozos and correcting my incorrect assumption that refresh.hold
would work long-term. I was misled by the lead sentence "Use refresh.hold to delay snap refreshes until a defined time and date." in snap docs (1) and by the command running without an error/warning.. Another lesson to not skim documentation..
@eug48 sorry for the trouble and thanks for the report. The fact that it hangs during the copy-data phase is curious. I think you mentioned you have one node in the bad state?
It looks like the data in /var/snap/microk8s/ did not even got started to get copied, i.e. the new snaps data dir did not even get created, is this correct?
@mvo5 yes, /snap/microk8s/1254 got created but in /var/snap/microk8s/ there is only 1176.
Upon further investigation I've probably found the cause. I've been trying out rook-ceph and there is still a volume mounted with it:
/dev/rbd0 on /var/snap/microk8s/common/var/lib/kubelet/pods/4dbf852e-f740-4a9f-b72d-de1b50120983/volumes/kubernetes.io~csi/pvc-149ca422-8e37-48f4-b087-98cd31d06c43/mount type ext4 (rw,relatime,stripe=1024,data=ordered)
However trying to ls
some sub-directories within it hangs forever and dmesg
is full of errors like libceph: mon0 10.152.183.195:6789 socket error on read
. So ceph is failing to connect to its service, which is running inside the cluster, but flanneld has been stopped.
sync
also hangs forever and sure enough it looks like snapd has launched a sync
on which it is presumably waiting.
So this is already a complex and therefore rather brittle set-up, and I think having snap auto-refreshes added to the mix makes failure much more likely. Having an option for it to be turned off permanently so that users can upgrade manually and fix these kinds of problems would be great for production use.
For anyone reading this with the same issue, snapd
currently doesnt allow any indefinite auto-update disabling, other than this suggestion through a forum thread with a bigger umbrella issue discussion about this; https://forum.snapcraft.io/t/disabling-automatic-refresh-for-snap-from-store/707/268
TL;DR: snap download foo ; snap install foo.snap --dangerous
, replace foo
with application in question.
Personally, the fact that snapd
doesn't allow more comprehensive and cooperative options like (web)hooks on new snap version, and have an external system handle manual refreshes one-by-one (for example by draining a node first before refreshing it, running some canary checks on it, then doing the same one-by-one for the rest, reverting entirely on first error (+ autoemail describing failed upgrade)), would be of great help and would even be still within the ethos of keeping snaps up-to-date, but the system to trust sysadmins to make it so needs to be there.
I also have problems with my microk8s cluster that my be related to this issue. I am experiencing regular service failure that start almost exactly at 2am in a not yet discovered interval (some days). The exact service is a VerneMQ MQTT server that is using a MariaDB for authentication. The result is that authentication does not work after that (unknown) event happened. This event could relate to snap activities as I discovered some micro8ks restarting by snap around that time. I also had this behavior with other services. My assumption is that the failure may be related to persistent network connections that fail on k8s level and the applications do not notice the failure. After rescheduling the corresponding pods everything works fine again.
I also would like to disable auto refresh for microk8s to further investigate the problem and to proof my assumption. Does anyone have any other ideas?
Hi @skobow, is it possible you were following the latest/edge
channel? What do you get from snap list | grep microk8s
?
Hi @ktsakalozos, I am using 1.19/stable channel which currently installs v1.19.2
Nothing got released on 1.19/stable
. You could attach the microk8s.inspect
tarball so we can take a look.
Find the tarball attached.
The reason for my assumption is the output of snap changes
that shows:
ID Status Spawn Ready Summary
166 Done today at 02:51 CEST today at 02:51 CEST Running service command for snap "microk8s"
167 Done today at 02:51 CEST today at 02:52 CEST Running service command for snap "microk8s"
168 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s"
169 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s"
170 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s"
171 Done today at 04:35 CEST today at 04:35 CEST Auto-refresh snaps "core", "snapd"
172 Done today at 04:51 CEST today at 04:51 CEST Running service command for snap "microk8s"
173 Done today at 04:51 CEST today at 04:52 CEST Running service command for snap "microk8s"
174 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s"
175 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s"
176 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s"
177 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
178 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
179 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
180 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
181 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s"
182 Error today at 10:27 CEST today at 10:27 CEST Change configuration of "core" snap
183 Done today at 10:29 CEST today at 10:29 CEST Change configuration of "core" snap
snap change 166
then shows:
Status Spawn Ready Summary
Done today at 02:51 CEST today at 02:51 CEST restart of [microk8s.daemon-etcd]
The time stamps fit for the service stop working. Even though there might not be any updates something happens anyway. Could that be related? inspection-report-20201014_103630.tar.gz
Hi! Fyi: exactly the same happened tonight at the same time. @ktsakalozos what are these service commands and why are they run?
I am not sure why snapd decides to restart MicroK8s. Could you attach the snapd log journalctl -u snapd -n 3000
. If we do not see anything there we may need to ask over at https://forum.snapcraft.io/
@ktsakalozos Any news on this topic?
@skobow in the snapd.log I see these failures:
Oct 16 13:04:32 k8s-master snapd[803]: storehelpers.go:551: cannot refresh: snap has no updates available: "core", "core18", "lxd", "microk8s", "snapd"
Oct 16 13:04:32 k8s-master snapd[803]: stateengine.go:150: state ensure error: cannot sections: got unexpected HTTP status code 403 via GET to "https://api.snapcraft.io/api/v1/snaps/sections"
If you do not know what might be causing this we will go to https://forum.snapcraft.io/ and ask there.
hello. I believe I'm running in to the same issue here as well.
$ snap changes
ID Status Spawn Ready Summary
198 Doing today at 16:59 UTC - Auto-refresh snap "microk8s"
$ snap tasks 198
Status Spawn Ready Summary
Done today at 16:59 UTC today at 16:59 UTC Ensure prerequisites for "microk8s" are available
Done today at 16:59 UTC today at 17:04 UTC Download snap "microk8s" (2074) from channel "1.20/stable"
Done today at 16:59 UTC today at 17:04 UTC Fetch and check assertions for snap "microk8s" (2074)
Done today at 16:59 UTC today at 17:04 UTC Mount snap "microk8s" (2074)
Done today at 16:59 UTC today at 17:04 UTC Run pre-refresh hook of "microk8s" snap if present
Done today at 16:59 UTC today at 17:06 UTC Stop snap "microk8s" services
Done today at 16:59 UTC today at 17:06 UTC Remove aliases for snap "microk8s"
Done today at 16:59 UTC today at 17:07 UTC Make current revision for snap "microk8s" unavailable
Doing today at 16:59 UTC - Copy snap "microk8s" data
Do today at 16:59 UTC - Setup snap "microk8s" (2074) security profiles
Do today at 16:59 UTC - Make snap "microk8s" (2074) available to the system
Do today at 16:59 UTC - Automatically connect eligible plugs and slots of snap "microk8s"
Do today at 16:59 UTC - Set automatic aliases for snap "microk8s"
Do today at 16:59 UTC - Setup snap "microk8s" aliases
Do today at 16:59 UTC - Run post-refresh hook of "microk8s" snap if present
Do today at 16:59 UTC - Start snap "microk8s" (2074) services
Do today at 16:59 UTC - Remove data for snap "microk8s" (1910)
Do today at 16:59 UTC - Remove snap "microk8s" (1910) from the system
Do today at 16:59 UTC - Clean up "microk8s" (2074) install
Do today at 16:59 UTC - Run configure hook of "microk8s" snap if present
Do today at 16:59 UTC - Run health check of "microk8s" snap
Doing today at 16:59 UTC - Consider re-refresh of "microk8s"
It appears whenever snap decides to auto-refresh, microk8s hangs on the copy step and never completes (taking the cluster down).
The only things that seem to be effective were either rebooting the machine or recently discovered that I could abort the auto-refresh:
$ sudo snap abort 198
... wait ...
$ sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress
$ snap changes
ID Status Spawn Ready Summary
198 Abort today at 16:59 UTC - Auto-refresh snap "microk8s"
$ snap list
Name Version Rev Tracking Publisher Notes
core 16-2.49 10859 latest/stable canonical✓ core
core18 20210128 1988 latest/stable canonical✓ base
docker 19.03.13 796 latest/stable canonical✓ -
helm3 3.1.2 5 latest/stable terraform-snap -
lxd 4.12 19766 latest/stable/… canonical✓ -
microk8s v1.20.2 2035 1.20/stable canonical✓ disabled,classic
snapd 2.49 11107 latest/stable canonical✓ snapd
$ sudo killall snapd
However, eventually the auto-refresh happens again...
~$ snap changes
ID Status Spawn Ready Summary
198 Undone today at 16:59 UTC today at 19:09 UTC Auto-refresh snap "microk8s"
199 Doing today at 19:14 UTC - Auto-refresh snap "microk8s"
After reading this thread, I gave me the idea to look for unusual mounts that were lingering... and while i wasn't able to find references to libceph I did see that the nfs mounts from the nfs provisioner running in my cluster were erroring out.
dmesg
[2465660.413978] nfs: server 10.152.183.19 not responding, timed out
[2465667.326132] nfs: server 10.152.183.19 not responding, timed out
[2465677.686357] nfs: server 10.152.183.19 not responding, timed out
mounts (partial)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/data on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/0 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/log on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/1 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/cert on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/2 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/init.d on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/3 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19
Not sure if this is the actual cause but thought I'd share in case it was helpful to anyone. Did anyone find a resolution to this problem?
I experienced the same issue. Pods remain in terminating states. New ones are created but fail to run due to a connectivity issue that I can't reproduce outside of the pod. Removing deployments and services before recreating them does not help. I had to reinstall the whole cluster after resetting all nodes.
Name Version Rev Tracking Publisher Notes
core 16-2.49 10859 latest/stable canonical✓ core
core18 20210128 1988 latest/stable canonical✓ base
lxd 4.0.5 19188 4.0/stable/… canonical✓ -
microk8s v1.20.4 2074 latest/stable canonical✓ classic
snapd 2.49 11107 latest/stable canonical✓ snapd
Mar 17 11:30:12 api-bhs snapd[47810]: stateengine.go:150: state ensure error: cannot sections: got unexpected HTTP status code 403 via GET to "https://api.sna>
Mar 17 11:30:24 api-bhs snapd[47810]: main.go:155: Exiting on terminated signal.
Came across a probably related problem - snap refreshed microk8s and took the cluster down - all pods are in "sandbox changed" state then. Same goes for node reboots, btw. Microk8s restart usually helps, but sometimes I have to start from scratch to get it to work again.
It would be great to tackle one of those two. Happy to provide any log, as I can easily reproduce the sandbox-issue.
I'd be careful with "reliable production-ready Kubernetes distribution" (from https://ubuntu.com/blog/introduction-to-microk8s-part-1-2) until then :)
I also came across this issue today, also running rook-ceph in a 3 node cluster. The Rook/Ceph cluster works perfectly fine otherwise.
Great, thanks snap autorefresh, you have crashed my entire cluster with this:
/snap/microk8s/2338/bin/dqlite: symbol lookup error: /snap/microk8s/2338/bin/dqlite: undefined symbol: sqlite3_system_errno
Incase it's helpful to anyone here, I was able to permanently disable the auto-refresh by disabling the snapd service.
sudo systemctl stop snapd.service
then sudo systemctl mask snapd.service
to disable
sudo systemctl unmask snapd.service
and sudo systemctl start snapd.service
to reenable
Since doing this, I haven't had any stability issues with my cluster at all. This is my temporary fix until I have time to migrate my cluster to k3s or something that actually works.
These are all no permanent fixes, snap will never implement disabling auto updates, and this will always become a problem.
I just suggest not touching microk8s at all, only use it for development purposes, and ban it for all production purposes.
The Kubernetes project ships a few releases every month [1]. These releases include security, bug and regression fixes. Every production grade Kubernetes distribution should have a mechanism to release such fixes even before they are released from upstream. For MicroK8s this mechanism is the snaps. Snaps allow us to keep your Kubernetes infrastructure up to date not only with fresh Kubernetes binaries but also update/fix integrations with underlying system and the Kubernetes ecosystem.
If you do not want to take the risk of automated refreshes you have at least two options:
[1] https://github.com/kubernetes/kubernetes/releases [2] https://docs.ubuntu.com/snap-store-proxy/en/ [3] https://snapcraft.io/docs/keeping-snaps-up-to-date
@ktsakalozos the point of "security" is pretty moot if it breaks everything while updating it, it's defeating its own purpose.
@ShadowJonathan, I am not sure why you mention only security and in quotes. Any update that breaks the cluster is defeating its own purpose.
For anyone that wants to contribute back to this project we would be grateful if you could run non-production clusters with the candidate channels of the track you follow, for example 1.20/candidate
. Normally, the stable channel gets updated with what is on candidate after about a week. Having candidate releases well tested in a large variety of setup would be great.
I am not sure why you mention only security and in quotes.
Your point was that security is paramount and absolute, that it should be the excuse that makes this problem okay, it's not, it's an excuse that only exasperates this problem and the whole of snap for servers in general.
Snaps are fine for user apps, those can deal with being restarted, crashing, shutting down, again and again. Server apps need more delicacy, planning, and oversight. Any admin/operator would not want the developer control over when, how, and why something will update, they want complete control over their systems, and the snaps auto-updating feature is a complete insult to that.
Any update that breaks the cluster is defeating its own purpose.
I'm glad you agree, then? I'd rather have a cluster which is outdated and vulnerable, and possibly get hacked, if it's about my own oversight and my own fault (at least then i can tune it to my own schedule and my own system). With auto-update, and even the update window, that control is taken away from me, as now i have to scramble to make sure the eventual update will not fuck with my system, and then to do it manually, safe, and controlled to make sure it does not fuck over the data. (which it did for me, 1.2TB of scraping data, all corrupted because docker didnt want to close within 30 seconds, after which it got SIGKILLd)
As a sysadmin, I control a developer's software, when, where, and how. The developer doesn't control my system, unless I tell it to. And even then, only on my own conditions.
Snaps violated this principle, and that's why I'm incredibly displeased with them.
@ktsakalozos the point of "security" is pretty moot if it breaks everything while updating it, it's defeating its own purpose.
i thinking same.... but they are telling that is production ready... really???
@lfdominguez branding
but if microk8s get out from snap.... or use another method, like executable self-contained (like k3s o k0s) i think that is better, you get out of the insane snap auto-refresh....
It looks like an option in snapd to disable updates for a given package would satisfy most people who are complaining here (me including).
Strangely, issues have been disabled on https://github.com/snapcore/snapd. I wonder where an issue could be created to discuss new options to snapd.
Unfortunately, it looks like people working on the project have a very strong opinion on the auto-refresh feature so I wonder if it is worth spending more time: https://forum.snapcraft.io/t/disabling-automatic-refresh-for-snap-from-store/707/4
Perhaps, it's time to look at other solutions like k3s as it was already mentioned.
It looks like an option in snapd to disable updates for a given package would satisfy most people who are complaining here (me including).
Strangely, issues have been disabled on https://github.com/snapcore/snapd. I wonder where an issue could be created to discuss new options to snapd.
Unfortunately, it looks like people working on the project have a very strong opinion on the auto-refresh feature so I wonder if it is worth spending more time: https://forum.snapcraft.io/t/disabling-automatic-refresh-for-snap-from-store/707/4
Perhaps, it's time to look at other solutions like k3s as it was already mentioned.
Disable the issues and call them itself as community driven.... is a shame, but well, mi cluster is already UP with 50 nodes, then i go with the 127.0.0.1
thing.... in a future if i get some free time i will fork snapd project and patch with an auto-refresh option....
disable updates for a given package would satisfy most people who are complaining here
You are given this option. You can block, schedule or postpone refreshes. You are even given the option to try an "offline" manual deployment that never updates. Although this may suit your needs, it is very hard from the Kubernetes distribution perspective to call this a good practice.
@ktsakalozos As it was explained, scheduling and postponing are not an option. We want full control, meaning we would like to disable auto-updates but keep the possibility to trigger an update manually when we want.
Blocking would be great but setting up a snapstore proxy looks overly complex and quite low level.
You are even given the option to try an "offline" manual deployment that never updates.
Thanks for pointing this. This looks like a solution, although it seems we need to first download a snap file locally before installing it.
Really, really... I have yet to read any real technical impediments that developers are relying on to not enable that functionality. All I read is whether it is better for the user or not, please not all users are dumb and there are many who manage technical services and not having that option is like telling them "You don't know what you are doing let us deal with it".
You are given this option. You can block, schedule or postpone refreshes.
Block
False, point me where exactly in your comment where you said that, and keep in mind I don't consider the snap store proxy to be a solution in this.
@lfdominguez I have the same feeling. Even if I understand the point of enabling auto-refresh by default, making everything possible and obscure to prevent people from disabling the feature is really sad.
And this happened again with the 1.19.14 release:
/snap/microk8s/2408/bin/dqlite: symbol lookup error: /snap/microk8s/2408/bin/dqlite: undefined symbol: sqlite3_system_errno
I was so happy with microk8s so far, but this is just bad.
@lalinsky where do you see this error? Is this on some daemon logs or do you get this when you execute a command?
Could you attach a microk8s.inspect tarball?
@lalinsky I see it now, it is on microk8s.status
. I just reverted the release on 1.19 stable channel. If you snap refresh microk8s
on your nodes you should get the 2339 .
@ktsakalozos It's happening again now with 2483, same error as before
/snap/microk8s/2483/bin/dqlite: symbol lookup error: /snap/microk8s/2483/bin/dqlite: undefined symbol: sqlite3_system_errno
@JeongJuhyeon could you please sudo snap refresh microk8s
?
Today a have experienced a crash of the PRODUCTION microk8s 3-nodes "HA" cluster. It just auto-updated to 1.21.5 ! As a programmer, admin, my mind even cannot comprehend what people deciding for the crucial services packaging have in mind to choose such a DNA broken tool as a snap??? Why at all UBUNTU uses it, when it hardly suitable even for desktop apps, and not suitable for services at all??? What is some medic.stuff would buy their adverting as "highly available" and people die because it auto-updates??? They must drop snap for anything aside the desktop apps, and better drop it at all and use proven by years .deb ...
@ktsakalozos It's one of the first things we tried but it did not help. Not sure about now, it's a staging server and after hours now.
@vazir Absolutely, it's now become a priority for us to migrate away from microk8s. It's a bit irresponsible that the docs/github page does not have a big warning saying "microk8s should not be used in any kind of situation where multiple people depend on it" being it's obviously a complete deal-breaker for any kind of serious usage. Going by this thread this has been an issue from day 1 and is never going to change with the unwillingness of snap to make autorefresh easy to disable. The only way any kind of auto-update system is reasonable for serious applications is if you're Microsoft and have complete control over every line of update code you ship as well as gigantic resources to test every possible setup and scenario. And even then they still let you disable it on systems meant for serious use cases.
I suspect people behind snap even scoff at us, as seems they are not allowing delaying auto-refresh more than for 60 days.
# snap set system refresh.hold=2050-01-01T15:04:05Z # snap refresh --time timer: 00:00~24:00/4 last: today at 10:22 UTC hold: in 60 days, at 10:22 UTC next: today at 17:25 UTC (but held)
@JeongJuhyeon please try to refresh.
@vazir, @JeongJuhyeon if you you want to take control over the updates you can either schedule them in a convenient time or install the snap store proxy mentioned in a previous comment. If you are certain you will not need updates in the future you can perform an offline deployment that will never update, we describe the few commands you need to run in the MicroK8s docs. You do however understand that MicroK8s cannot stop shipping patch releases and updates.
@vazir I am interested to know how did the cluster crash and what did you have to do to recover it? Thank you.
@ktsakalozos i'm still awaiting an answer to my question
@vazir I am interested to know how did the cluster crash and what did you have to do to recover it? Thank you.
I had to restart all 3 nodes to bring it up again.
you do however understand that MicroK8s cannot stop shipping patch releases and updates.
For the rest, you mentioned, you probably understand, that all you suggested are the hacks, and totally not suitable for a distribution? As nobody suggests "stopping patches distribution", but do you understand that NO ONE in the clear mind auto-updates critical services, like databases, kubernets clusters, and similar stuff? When you deploy a highly resilient, highly available, highly redundant stuff, which as k8s, which on it's own ATTEMPTS to be resistant to failures, trying to keep itself alive... AND YOU DO kind of "kill -9" it from the outside??? What is the whole reason of developing microk8s at all than if you totally ruin it's redundancy and resiliency by simply mischoosing the distribution tool?
@ktsakalozos I'm sure you do understand, that simultaneous update of the package on whatever number of the nodes, WILL lead to service stop, as ALL the nodes restarts the service simultaneously? For kubernetes, you (normally) do not restart the node, you drain it, making sure other nodes catch up for the pods, and than upgrade/maintain. So you just CANNOT distribute patches like you do. Yes, you just cannot. It may sound a litle strong, but may you imagine when lives depend on the stuff you do, or mass transport, or... whatever, where multiple persons affected? You DO name your microk8s SUPER reliable, So ANY, even minor upgrades must be done by the admin, and admin might choose than to implement auto-upgrade, because the admin will do it right, making node drain, etc.
How is there still no option to disable auto updates and only do updates manually ? My production cluster went down today because snap killed a node. I had to work after hours to bring it back up. When ceph runs on a cluster, it needs very manual, and very gentle intervention to bring down a node gracefully. The way snap killed the node made connectivity errors pop up everywhere and I had to hard reboot every single node in the cluster to get back full control. After a year and a half, it's unbelievable that we still can't disable auto updates !
This morning a close-to-production cluster fell over after snap's auto-refresh "feature" failed on 3 of 4 worker nodes - looks like it hanged at the
Copy snap "microk8s" data
step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd.. For a production-ready Kubernetes distribution I really think this is a far from acceptable default.. Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend runningsudo snap set system refresh.hold=2050-01-01T15:04:05Z
or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.Otherwise microk8s is working rather well so thank you very much.
More details about the outage:
microk8s is disabled..
Data copy appears hanged
There doesn't seem to be much to copy anyway:
Starting microk8s fails
Fails to abort..
snapd service hangs when trying to stop it...
have to resort to manually stopping the process
finally change is undone..
Nothing much in snapd logs except for a polkit error - unsure if related: