snap auto-refresh breaks cluster

eug48 commented 4 years ago

This morning a close-to-production cluster fell over after snap's auto-refresh "feature" failed on 3 of 4 worker nodes - looks like it hanged at the Copy snap "microk8s" data step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd.. For a production-ready Kubernetes distribution I really think this is a far from acceptable default.. Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend running sudo snap set system refresh.hold=2050-01-01T15:04:05Z or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.

Otherwise microk8s is working rather well so thank you very much.

More details about the outage:

kubectl get nodes
NAME           STATUS     ROLES    AGE   VERSION
10.aa.aa.aaa   Ready      <none>   38d   v1.17.3
10.aa.aa.aaa   NotReady   <none>   18d   v1.17.2
10.aa.aa.aaa   NotReady   <none>   38d   v1.17.2
10.aa.aa.aaa   NotReady   <none>   18d   v1.17.2
aaa-master     Ready      <none>   59d   v1.17.3

microk8s is disabled..

root@wk3:/home# snap list
Name      Version    Rev   Tracking  Publisher   Notes
core      16-2.43.3  8689  stable    canonical✓  core
kubectl   1.17.3     1424  1.17      canonical✓  classic
microk8s  v1.17.2    1176  1.17      canonical✓  disabled,classic

root@wk3:/home# snap changes microk8s
ID   Status  Spawn                Ready  Summary
20   Doing   today at 09:56 AEDT  -      Auto-refresh snap "microk8s"

Data copy appears hanged

root@wk3:/home# snap tasks --last=auto-refresh
Status  Spawn                Ready                Summary
Done    today at 09:56 AEDT  today at 09:56 AEDT  Ensure prerequisites for "microk8s" are available
Done    today at 09:56 AEDT  today at 09:56 AEDT  Download snap "microk8s" (1254) from channel "1.17/stable"
Done    today at 09:56 AEDT  today at 09:56 AEDT  Fetch and check assertions for snap "microk8s" (1254)
Done    today at 09:56 AEDT  today at 09:56 AEDT  Mount snap "microk8s" (1254)
Done    today at 09:56 AEDT  today at 09:56 AEDT  Run pre-refresh hook of "microk8s" snap if present
Done    today at 09:56 AEDT  today at 09:57 AEDT  Stop snap "microk8s" services
Done    today at 09:56 AEDT  today at 09:57 AEDT  Remove aliases for snap "microk8s"
Done    today at 09:56 AEDT  today at 09:57 AEDT  Make current revision for snap "microk8s" unavailable
Doing   today at 09:56 AEDT  -                    Copy snap "microk8s" data
Do      today at 09:56 AEDT  -                    Setup snap "microk8s" (1254) security profiles
Do      today at 09:56 AEDT  -                    Make snap "microk8s" (1254) available to the system
Do      today at 09:56 AEDT  -                    Automatically connect eligible plugs and slots of snap "microk8s"
Do      today at 09:56 AEDT  -                    Set automatic aliases for snap "microk8s"
Do      today at 09:56 AEDT  -                    Setup snap "microk8s" aliases
Do      today at 09:56 AEDT  -                    Run post-refresh hook of "microk8s" snap if present
Do      today at 09:56 AEDT  -                    Start snap "microk8s" (1254) services
Do      today at 09:56 AEDT  -                    Clean up "microk8s" (1254) install
Do      today at 09:56 AEDT  -                    Run configure hook of "microk8s" snap if present
Do      today at 09:56 AEDT  -                    Run health check of "microk8s" snap
Doing   today at 09:56 AEDT  -                    Consider re-refresh of "microk8s"

There doesn't seem to be much to copy anyway:

root@wk3 /v/l/snapd# du -sh /var/lib/snapd/ /var/snap/ /snap
527M    /var/lib/snapd/
74G /var/snap/
2.0G    /snap

root@wk3 /s/microk8s# du -sh /snap/microk8s/*
737M    /snap/microk8s/1176
737M    /snap/microk8s/1254

root@wk3 /s/microk8s# du -sh /var/snap/microk8s/*
232K    /var/snap/microk8s/1176
74G /var/snap/microk8s/common

Starting microk8s fails

user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

Fails to abort..

root@wk3:/home# snap abort 20
root@wk3:/home# snap changes
ID   Status  Spawn                Ready  Summary
20   Abort   today at 09:56 AEDT  -      Auto-refresh snap "microk8s"

user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

snapd service hangs when trying to stop it...

root@wk2 ~# systemctl stop snapd.service
(hangs)

have to resort to manually stopping the process

killall snapd

finally change is undone..

root@wk3:/home# snap changes
ID   Status  Spawn                Ready                Summary
20   Undone  today at 09:56 AEDT  today at 10:41 AEDT  Auto-refresh snap "microk8s"

root@wk3:/home# snap tasks --last=auto-refresh
Status  Spawn                Ready                Summary
Done    today at 09:56 AEDT  today at 10:41 AEDT  Ensure prerequisites for "microk8s" are available
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Download snap "microk8s" (1254) from channel "1.17/stable"
Done    today at 09:56 AEDT  today at 10:41 AEDT  Fetch and check assertions for snap "microk8s" (1254)
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Mount snap "microk8s" (1254)
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Run pre-refresh hook of "microk8s" snap if present
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Stop snap "microk8s" services
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Remove aliases for snap "microk8s"
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Make current revision for snap "microk8s" unavailable
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Copy snap "microk8s" data
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Setup snap "microk8s" (1254) security profiles
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Make snap "microk8s" (1254) available to the system
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Automatically connect eligible plugs and slots of snap "microk8s"
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Set automatic aliases for snap "microk8s"
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Setup snap "microk8s" aliases
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run post-refresh hook of "microk8s" snap if present
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Start snap "microk8s" (1254) services
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Clean up "microk8s" (1254) install
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run configure hook of "microk8s" snap if present
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run health check of "microk8s" snap
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Consider re-refresh of "microk8s

root@wk3:/home# snap list
Name      Version    Rev   Tracking  Publisher   Notes
core      16-2.43.3  8689  stable    canonical✓  core
kubectl   1.17.3     1424  1.17      canonical✓  classic
microk8s  v1.17.2    1176  1.17      canonical✓  classic

Nothing much in snapd logs except for a polkit error - unsure if related:

root@wk3:/home# journalctl -b -u snapd.service

...
Mar 09 06:11:34 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 16:11:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 16:11:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 19:06:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 19:06:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 02:51:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 10 02:51:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 09:56:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl"
Mar 10 10:12:18 wk3 snapd[15182]: daemon.go:208: polkit error: Authorization requires interaction
Mar 10 10:39:24 wk3 systemd[1]: Stopping Snappy daemon...
Mar 10 10:39:24 wk3 snapd[15182]: main.go:155: Exiting on terminated signal.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Killing process 15182 (snapd) with signal SIGKILL.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Failed with result 'timeout'.
Mar 10 10:40:54 wk3 systemd[1]: Stopped Snappy daemon.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Found left-over process 16729 (sync) in control group while starting unit. Ignoring.
Mar 10 10:40:54 wk3 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 10 10:40:54 wk3 systemd[1]: Starting Snappy daemon...
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:346: started snapd/2.43.3 (series 16; classic) ubuntu/18.04 (amd64) linux/4.15.0-88-generic.
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:439: adjusting startup timeout by 45s (pessimistic estimate of 30s plus 5s per snap)
Mar 10 10:40:54 wk3 systemd[1]: Started Snappy daemon.

eug48 commented 4 years ago

I've left one worker node in the stuck state in case that's useful for troubleshooting, and have now come across a well-known issue in that pods running on that NotReady node are stuck as Terminating. With prometheus this is a problem because the StatefulSet will not start another instance until that pod is manually force-deleted. Just mentioning here as this is another reason why I think microk8s is not ready for auto-refreshes.

ktsakalozos commented 4 years ago

Thank you for reporting this @eug48. I opened an issue/topic with the snap team at [1].

One note here is that you cannot hold snap refreshes forever (sudo snap set system refresh.hold=2050-01-01T15:04:05Z does not work). You can defer refreshes for up to 90 days, I think. If you want to block refreshes you need to setup a snap store proxy [2]

[1] https://forum.snapcraft.io/t/snap-refresh-breaks-microk8s-cluster/15906 [2] https://docs.ubuntu.com/snap-store-proxy/en/

eug48 commented 4 years ago

Thanks very much for raising that @ktsakalozos and correcting my incorrect assumption that refresh.hold would work long-term. I was misled by the lead sentence "Use refresh.hold to delay snap refreshes until a defined time and date." in snap docs (1) and by the command running without an error/warning.. Another lesson to not skim documentation..

mvo5 commented 4 years ago

@eug48 sorry for the trouble and thanks for the report. The fact that it hangs during the copy-data phase is curious. I think you mentioned you have one node in the bad state?

It looks like the data in /var/snap/microk8s/ did not even got started to get copied, i.e. the new snaps data dir did not even get created, is this correct?

eug48 commented 4 years ago

@mvo5 yes, /snap/microk8s/1254 got created but in /var/snap/microk8s/ there is only 1176.

Upon further investigation I've probably found the cause. I've been trying out rook-ceph and there is still a volume mounted with it:

/dev/rbd0 on /var/snap/microk8s/common/var/lib/kubelet/pods/4dbf852e-f740-4a9f-b72d-de1b50120983/volumes/kubernetes.io~csi/pvc-149ca422-8e37-48f4-b087-98cd31d06c43/mount type ext4 (rw,relatime,stripe=1024,data=ordered)

However trying to ls some sub-directories within it hangs forever and dmesg is full of errors like libceph: mon0 10.152.183.195:6789 socket error on read. So ceph is failing to connect to its service, which is running inside the cluster, but flanneld has been stopped.

sync also hangs forever and sure enough it looks like snapd has launched a sync on which it is presumably waiting.

So this is already a complex and therefore rather brittle set-up, and I think having snap auto-refreshes added to the mix makes failure much more likely. Having an option for it to be turned off permanently so that users can upgrade manually and fix these kinds of problems would be great for production use.

ShadowJonathan commented 4 years ago

For anyone reading this with the same issue, snapd currently doesnt allow any indefinite auto-update disabling, other than this suggestion through a forum thread with a bigger umbrella issue discussion about this; https://forum.snapcraft.io/t/disabling-automatic-refresh-for-snap-from-store/707/268

TL;DR: snap download foo ; snap install foo.snap --dangerous, replace foo with application in question.

Personally, the fact that snapd doesn't allow more comprehensive and cooperative options like (web)hooks on new snap version, and have an external system handle manual refreshes one-by-one (for example by draining a node first before refreshing it, running some canary checks on it, then doing the same one-by-one for the rest, reverting entirely on first error (+ autoemail describing failed upgrade)), would be of great help and would even be still within the ethos of keeping snaps up-to-date, but the system to trust sysadmins to make it so needs to be there.

skobow commented 4 years ago

I also have problems with my microk8s cluster that my be related to this issue. I am experiencing regular service failure that start almost exactly at 2am in a not yet discovered interval (some days). The exact service is a VerneMQ MQTT server that is using a MariaDB for authentication. The result is that authentication does not work after that (unknown) event happened. This event could relate to snap activities as I discovered some micro8ks restarting by snap around that time. I also had this behavior with other services. My assumption is that the failure may be related to persistent network connections that fail on k8s level and the applications do not notice the failure. After rescheduling the corresponding pods everything works fine again.

I also would like to disable auto refresh for microk8s to further investigate the problem and to proof my assumption. Does anyone have any other ideas?

ktsakalozos commented 4 years ago

Hi @skobow, is it possible you were following the latest/edge channel? What do you get from snap list | grep microk8s?

skobow commented 4 years ago

Hi @ktsakalozos, I am using 1.19/stable channel which currently installs v1.19.2

ktsakalozos commented 4 years ago

Nothing got released on 1.19/stable. You could attach the microk8s.inspect tarball so we can take a look.

skobow commented 4 years ago

Find the tarball attached. The reason for my assumption is the output of snap changes that shows:

ID Status Spawn Ready Summary 166 Done today at 02:51 CEST today at 02:51 CEST Running service command for snap "microk8s" 167 Done today at 02:51 CEST today at 02:52 CEST Running service command for snap "microk8s" 168 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s" 169 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s" 170 Done today at 02:52 CEST today at 02:52 CEST Running service command for snap "microk8s" 171 Done today at 04:35 CEST today at 04:35 CEST Auto-refresh snaps "core", "snapd" 172 Done today at 04:51 CEST today at 04:51 CEST Running service command for snap "microk8s" 173 Done today at 04:51 CEST today at 04:52 CEST Running service command for snap "microk8s" 174 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s" 175 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s" 176 Done today at 04:52 CEST today at 04:52 CEST Running service command for snap "microk8s" 177 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s" 178 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s" 179 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s" 180 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s" 181 Done today at 05:08 CEST today at 05:08 CEST Running service command for snap "microk8s" 182 Error today at 10:27 CEST today at 10:27 CEST Change configuration of "core" snap 183 Done today at 10:29 CEST today at 10:29 CEST Change configuration of "core" snap

snap change 166then shows:

Status Spawn Ready Summary Done today at 02:51 CEST today at 02:51 CEST restart of [microk8s.daemon-etcd]

The time stamps fit for the service stop working. Even though there might not be any updates something happens anyway. Could that be related? inspection-report-20201014_103630.tar.gz

skobow commented 4 years ago

Hi! Fyi: exactly the same happened tonight at the same time. @ktsakalozos what are these service commands and why are they run?

ktsakalozos commented 4 years ago

I am not sure why snapd decides to restart MicroK8s. Could you attach the snapd log journalctl -u snapd -n 3000. If we do not see anything there we may need to ask over at https://forum.snapcraft.io/

skobow commented 4 years ago

@ktsakalozos find the log attached!

snapd.log

skobow commented 4 years ago

@ktsakalozos Any news on this topic?

ktsakalozos commented 4 years ago

@skobow in the snapd.log I see these failures:

Oct 16 13:04:32 k8s-master snapd[803]: storehelpers.go:551: cannot refresh: snap has no updates available: "core", "core18", "lxd", "microk8s", "snapd"
Oct 16 13:04:32 k8s-master snapd[803]: stateengine.go:150: state ensure error: cannot sections: got unexpected HTTP status code 403 via GET to "https://api.snapcraft.io/api/v1/snaps/sections"

If you do not know what might be causing this we will go to https://forum.snapcraft.io/ and ask there.

pw10n commented 3 years ago

hello. I believe I'm running in to the same issue here as well.

$ snap changes
ID   Status  Spawn               Ready  Summary
198  Doing   today at 16:59 UTC  -      Auto-refresh snap "microk8s"
$ snap tasks 198
Status  Spawn               Ready               Summary
Done    today at 16:59 UTC  today at 16:59 UTC  Ensure prerequisites for "microk8s" are available
Done    today at 16:59 UTC  today at 17:04 UTC  Download snap "microk8s" (2074) from channel "1.20/stable"
Done    today at 16:59 UTC  today at 17:04 UTC  Fetch and check assertions for snap "microk8s" (2074)
Done    today at 16:59 UTC  today at 17:04 UTC  Mount snap "microk8s" (2074)
Done    today at 16:59 UTC  today at 17:04 UTC  Run pre-refresh hook of "microk8s" snap if present
Done    today at 16:59 UTC  today at 17:06 UTC  Stop snap "microk8s" services
Done    today at 16:59 UTC  today at 17:06 UTC  Remove aliases for snap "microk8s"
Done    today at 16:59 UTC  today at 17:07 UTC  Make current revision for snap "microk8s" unavailable
Doing   today at 16:59 UTC  -                   Copy snap "microk8s" data
Do      today at 16:59 UTC  -                   Setup snap "microk8s" (2074) security profiles
Do      today at 16:59 UTC  -                   Make snap "microk8s" (2074) available to the system
Do      today at 16:59 UTC  -                   Automatically connect eligible plugs and slots of snap "microk8s"
Do      today at 16:59 UTC  -                   Set automatic aliases for snap "microk8s"
Do      today at 16:59 UTC  -                   Setup snap "microk8s" aliases
Do      today at 16:59 UTC  -                   Run post-refresh hook of "microk8s" snap if present
Do      today at 16:59 UTC  -                   Start snap "microk8s" (2074) services
Do      today at 16:59 UTC  -                   Remove data for snap "microk8s" (1910)
Do      today at 16:59 UTC  -                   Remove snap "microk8s" (1910) from the system
Do      today at 16:59 UTC  -                   Clean up "microk8s" (2074) install
Do      today at 16:59 UTC  -                   Run configure hook of "microk8s" snap if present
Do      today at 16:59 UTC  -                   Run health check of "microk8s" snap
Doing   today at 16:59 UTC  -                   Consider re-refresh of "microk8s"

It appears whenever snap decides to auto-refresh, microk8s hangs on the copy step and never completes (taking the cluster down).

The only things that seem to be effective were either rebooting the machine or recently discovered that I could abort the auto-refresh:

$ sudo snap abort 198

... wait ...

$ sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

$ snap changes
ID   Status  Spawn               Ready  Summary
198  Abort   today at 16:59 UTC  -      Auto-refresh snap "microk8s"

$ snap list
Name      Version   Rev    Tracking         Publisher       Notes
core      16-2.49   10859  latest/stable    canonical✓      core
core18    20210128  1988   latest/stable    canonical✓      base
docker    19.03.13  796    latest/stable    canonical✓      -
helm3     3.1.2     5      latest/stable    terraform-snap  -
lxd       4.12      19766  latest/stable/…  canonical✓      -
microk8s  v1.20.2   2035   1.20/stable      canonical✓      disabled,classic
snapd     2.49      11107  latest/stable    canonical✓      snapd

$ sudo killall snapd

However, eventually the auto-refresh happens again...

~$ snap changes
ID   Status  Spawn               Ready               Summary
198  Undone  today at 16:59 UTC  today at 19:09 UTC  Auto-refresh snap "microk8s"
199  Doing   today at 19:14 UTC  -                   Auto-refresh snap "microk8s"

After reading this thread, I gave me the idea to look for unusual mounts that were lingering... and while i wasn't able to find references to libceph I did see that the nfs mounts from the nfs provisioner running in my cluster were erroring out.

dmesg
[2465660.413978] nfs: server 10.152.183.19 not responding, timed out
[2465667.326132] nfs: server 10.152.183.19 not responding, timed out
[2465677.686357] nfs: server 10.152.183.19 not responding, timed out

mounts (partial)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/data on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/0 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/log on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/1 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/cert on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/2 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19)
10.152.183.19:/export/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/init.d on /var/snap/microk8s/common/var/lib/kubelet/pods/0af536f4-3349-4860-a73c-1c62ce6fede7/volume-subpaths/pvc-375dbf2c-5c54-4e64-a649-6059c8926017/unifi/3 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.5.0.10,local_lock=none,addr=10.152.183.19

Not sure if this is the actual cause but thought I'd share in case it was helpful to anyone. Did anyone find a resolution to this problem?

lpellegr commented 3 years ago

I experienced the same issue. Pods remain in terminating states. New ones are created but fail to run due to a connectivity issue that I can't reproduce outside of the pod. Removing deployments and services before recreating them does not help. I had to reinstall the whole cluster after resetting all nodes.

Name      Version   Rev    Tracking       Publisher   Notes
core      16-2.49   10859  latest/stable  canonical✓  core
core18    20210128  1988   latest/stable  canonical✓  base
lxd       4.0.5     19188  4.0/stable/…   canonical✓  -
microk8s  v1.20.4   2074   latest/stable  canonical✓  classic
snapd     2.49      11107  latest/stable  canonical✓  snapd

Mar 17 11:30:12 api-bhs snapd[47810]: stateengine.go:150: state ensure error: cannot sections: got unexpected HTTP status code 403 via GET to "https://api.sna>
Mar 17 11:30:24 api-bhs snapd[47810]: main.go:155: Exiting on terminated signal.

t-o-o-m commented 3 years ago

Came across a probably related problem - snap refreshed microk8s and took the cluster down - all pods are in "sandbox changed" state then. Same goes for node reboots, btw. Microk8s restart usually helps, but sometimes I have to start from scratch to get it to work again.

one can't decide when updates should be performed (I know about the possibility to postpone or periodic scheduling, but I'd rather want to set a single point in time when to have potential downtimes). Might be possible with --devmode, but then I'd have to switch to a dev/edge channel
a simple update (and reboot in ny case) takes the cluster down, with manual work needed to get it up and running again

It would be great to tackle one of those two. Happy to provide any log, as I can easily reproduce the sandbox-issue.

I'd be careful with "reliable production-ready Kubernetes distribution" (from https://ubuntu.com/blog/introduction-to-microk8s-part-1-2) until then :)

andrew-landsverk-win commented 3 years ago

I also came across this issue today, also running rook-ceph in a 3 node cluster. The Rook/Ceph cluster works perfectly fine otherwise.

lfdominguez commented 3 years ago

Great, thanks snap autorefresh, you have crashed my entire cluster with this: /snap/microk8s/2338/bin/dqlite: symbol lookup error: /snap/microk8s/2338/bin/dqlite: undefined symbol: sqlite3_system_errno

pw10n commented 3 years ago

Incase it's helpful to anyone here, I was able to permanently disable the auto-refresh by disabling the snapd service.

sudo systemctl stop snapd.service then sudo systemctl mask snapd.service to disable sudo systemctl unmask snapd.service and sudo systemctl start snapd.service to reenable

Since doing this, I haven't had any stability issues with my cluster at all. This is my temporary fix until I have time to migrate my cluster to k3s or something that actually works.

ShadowJonathan commented 3 years ago

These are all no permanent fixes, snap will never implement disabling auto updates, and this will always become a problem.

I just suggest not touching microk8s at all, only use it for development purposes, and ban it for all production purposes.

ktsakalozos commented 3 years ago

The Kubernetes project ships a few releases every month [1]. These releases include security, bug and regression fixes. Every production grade Kubernetes distribution should have a mechanism to release such fixes even before they are released from upstream. For MicroK8s this mechanism is the snaps. Snaps allow us to keep your Kubernetes infrastructure up to date not only with fresh Kubernetes binaries but also update/fix integrations with underlying system and the Kubernetes ecosystem.

If you do not want to take the risk of automated refreshes you have at least two options:

use a snapstore proxy [3] to test revisions as they come and block those you do not want,
set the snap refresh window in a convenient time for you [2]. This will not completely stop refreshes but will allow you to postpone them.

[1] https://github.com/kubernetes/kubernetes/releases [2] https://docs.ubuntu.com/snap-store-proxy/en/ [3] https://snapcraft.io/docs/keeping-snaps-up-to-date

ShadowJonathan commented 3 years ago

@ktsakalozos the point of "security" is pretty moot if it breaks everything while updating it, it's defeating its own purpose.

ktsakalozos commented 3 years ago

@ShadowJonathan, I am not sure why you mention only security and in quotes. Any update that breaks the cluster is defeating its own purpose.

For anyone that wants to contribute back to this project we would be grateful if you could run non-production clusters with the candidate channels of the track you follow, for example 1.20/candidate. Normally, the stable channel gets updated with what is on candidate after about a week. Having candidate releases well tested in a large variety of setup would be great.

ShadowJonathan commented 3 years ago

I am not sure why you mention only security and in quotes.

Your point was that security is paramount and absolute, that it should be the excuse that makes this problem okay, it's not, it's an excuse that only exasperates this problem and the whole of snap for servers in general.

Snaps are fine for user apps, those can deal with being restarted, crashing, shutting down, again and again. Server apps need more delicacy, planning, and oversight. Any admin/operator would not want the developer control over when, how, and why something will update, they want complete control over their systems, and the snaps auto-updating feature is a complete insult to that.

Any update that breaks the cluster is defeating its own purpose.

I'm glad you agree, then? I'd rather have a cluster which is outdated and vulnerable, and possibly get hacked, if it's about my own oversight and my own fault (at least then i can tune it to my own schedule and my own system). With auto-update, and even the update window, that control is taken away from me, as now i have to scramble to make sure the eventual update will not fuck with my system, and then to do it manually, safe, and controlled to make sure it does not fuck over the data. (which it did for me, 1.2TB of scraping data, all corrupted because docker didnt want to close within 30 seconds, after which it got SIGKILLd)

As a sysadmin, I control a developer's software, when, where, and how. The developer doesn't control my system, unless I tell it to. And even then, only on my own conditions.

Snaps violated this principle, and that's why I'm incredibly displeased with them.

lfdominguez commented 3 years ago

@ktsakalozos the point of "security" is pretty moot if it breaks everything while updating it, it's defeating its own purpose.

i thinking same.... but they are telling that is production ready... really???

ShadowJonathan commented 3 years ago

@lfdominguez branding

lfdominguez commented 3 years ago

but if microk8s get out from snap.... or use another method, like executable self-contained (like k3s o k0s) i think that is better, you get out of the insane snap auto-refresh....

lpellegr commented 3 years ago

It looks like an option in snapd to disable updates for a given package would satisfy most people who are complaining here (me including).

Strangely, issues have been disabled on https://github.com/snapcore/snapd. I wonder where an issue could be created to discuss new options to snapd.

Unfortunately, it looks like people working on the project have a very strong opinion on the auto-refresh feature so I wonder if it is worth spending more time: https://forum.snapcraft.io/t/disabling-automatic-refresh-for-snap-from-store/707/4

Perhaps, it's time to look at other solutions like k3s as it was already mentioned.

lfdominguez commented 3 years ago

It looks like an option in snapd to disable updates for a given package would satisfy most people who are complaining here (me including).

Strangely, issues have been disabled on https://github.com/snapcore/snapd. I wonder where an issue could be created to discuss new options to snapd.

Unfortunately, it looks like people working on the project have a very strong opinion on the auto-refresh feature so I wonder if it is worth spending more time: https://forum.snapcraft.io/t/disabling-automatic-refresh-for-snap-from-store/707/4

Perhaps, it's time to look at other solutions like k3s as it was already mentioned.

Disable the issues and call them itself as community driven.... is a shame, but well, mi cluster is already UP with 50 nodes, then i go with the 127.0.0.1 thing.... in a future if i get some free time i will fork snapd project and patch with an auto-refresh option....

ktsakalozos commented 3 years ago

disable updates for a given package would satisfy most people who are complaining here

You are given this option. You can block, schedule or postpone refreshes. You are even given the option to try an "offline" manual deployment that never updates. Although this may suit your needs, it is very hard from the Kubernetes distribution perspective to call this a good practice.

lpellegr commented 3 years ago

@ktsakalozos As it was explained, scheduling and postponing are not an option. We want full control, meaning we would like to disable auto-updates but keep the possibility to trigger an update manually when we want.

Blocking would be great but setting up a snapstore proxy looks overly complex and quite low level.

You are even given the option to try an "offline" manual deployment that never updates.

Thanks for pointing this. This looks like a solution, although it seems we need to first download a snap file locally before installing it.

lfdominguez commented 3 years ago

Really, really... I have yet to read any real technical impediments that developers are relying on to not enable that functionality. All I read is whether it is better for the user or not, please not all users are dumb and there are many who manage technical services and not having that option is like telling them "You don't know what you are doing let us deal with it".

ShadowJonathan commented 3 years ago

You are given this option. You can block, schedule or postpone refreshes.

Block

False, point me where exactly in your comment where you said that, and keep in mind I don't consider the snap store proxy to be a solution in this.

lpellegr commented 3 years ago

@lfdominguez I have the same feeling. Even if I understand the point of enabling auto-refresh by default, making everything possible and obscure to prevent people from disabling the feature is really sad.

lalinsky commented 3 years ago

And this happened again with the 1.19.14 release:

/snap/microk8s/2408/bin/dqlite: symbol lookup error: /snap/microk8s/2408/bin/dqlite: undefined symbol: sqlite3_system_errno

I was so happy with microk8s so far, but this is just bad.

ktsakalozos commented 3 years ago

@lalinsky where do you see this error? Is this on some daemon logs or do you get this when you execute a command?

Could you attach a microk8s.inspect tarball?

ktsakalozos commented 3 years ago

@lalinsky I see it now, it is on microk8s.status . I just reverted the release on 1.19 stable channel. If you snap refresh microk8s on your nodes you should get the 2339 .

JeongJuhyeon commented 3 years ago

@ktsakalozos It's happening again now with 2483, same error as before /snap/microk8s/2483/bin/dqlite: symbol lookup error: /snap/microk8s/2483/bin/dqlite: undefined symbol: sqlite3_system_errno

ktsakalozos commented 3 years ago

@JeongJuhyeon could you please sudo snap refresh microk8s ?

vazir commented 3 years ago

Today a have experienced a crash of the PRODUCTION microk8s 3-nodes "HA" cluster. It just auto-updated to 1.21.5 ! As a programmer, admin, my mind even cannot comprehend what people deciding for the crucial services packaging have in mind to choose such a DNA broken tool as a snap??? Why at all UBUNTU uses it, when it hardly suitable even for desktop apps, and not suitable for services at all??? What is some medic.stuff would buy their adverting as "highly available" and people die because it auto-updates??? They must drop snap for anything aside the desktop apps, and better drop it at all and use proven by years .deb ...

JeongJuhyeon commented 3 years ago

@ktsakalozos It's one of the first things we tried but it did not help. Not sure about now, it's a staging server and after hours now.

@vazir Absolutely, it's now become a priority for us to migrate away from microk8s. It's a bit irresponsible that the docs/github page does not have a big warning saying "microk8s should not be used in any kind of situation where multiple people depend on it" being it's obviously a complete deal-breaker for any kind of serious usage. Going by this thread this has been an issue from day 1 and is never going to change with the unwillingness of snap to make autorefresh easy to disable. The only way any kind of auto-update system is reasonable for serious applications is if you're Microsoft and have complete control over every line of update code you ship as well as gigantic resources to test every possible setup and scenario. And even then they still let you disable it on systems meant for serious use cases.

vazir commented 3 years ago

I suspect people behind snap even scoff at us, as seems they are not allowing delaying auto-refresh more than for 60 days.

# snap set system refresh.hold=2050-01-01T15:04:05Z
# snap refresh --time
timer: 00:00~24:00/4
last: today at 10:22 UTC
hold: in 60 days, at 10:22 UTC
next: today at 17:25 UTC (but held)

ktsakalozos commented 3 years ago

@JeongJuhyeon please try to refresh.

@vazir, @JeongJuhyeon if you you want to take control over the updates you can either schedule them in a convenient time or install the snap store proxy mentioned in a previous comment. If you are certain you will not need updates in the future you can perform an offline deployment that will never update, we describe the few commands you need to run in the MicroK8s docs. You do however understand that MicroK8s cannot stop shipping patch releases and updates.

@vazir I am interested to know how did the cluster crash and what did you have to do to recover it? Thank you.

ShadowJonathan commented 3 years ago

@ktsakalozos i'm still awaiting an answer to my question

vazir commented 3 years ago

@vazir I am interested to know how did the cluster crash and what did you have to do to recover it? Thank you.

I had to restart all 3 nodes to bring it up again.

you do however understand that MicroK8s cannot stop shipping patch releases and updates.

For the rest, you mentioned, you probably understand, that all you suggested are the hacks, and totally not suitable for a distribution? As nobody suggests "stopping patches distribution", but do you understand that NO ONE in the clear mind auto-updates critical services, like databases, kubernets clusters, and similar stuff? When you deploy a highly resilient, highly available, highly redundant stuff, which as k8s, which on it's own ATTEMPTS to be resistant to failures, trying to keep itself alive... AND YOU DO kind of "kill -9" it from the outside??? What is the whole reason of developing microk8s at all than if you totally ruin it's redundancy and resiliency by simply mischoosing the distribution tool?

vazir commented 3 years ago

@ktsakalozos I'm sure you do understand, that simultaneous update of the package on whatever number of the nodes, WILL lead to service stop, as ALL the nodes restarts the service simultaneously? For kubernetes, you (normally) do not restart the node, you drain it, making sure other nodes catch up for the pods, and than upgrade/maintain. So you just CANNOT distribute patches like you do. Yes, you just cannot. It may sound a litle strong, but may you imagine when lives depend on the stuff you do, or mass transport, or... whatever, where multiple persons affected? You DO name your microk8s SUPER reliable, So ANY, even minor upgrades must be done by the admin, and admin might choose than to implement auto-upgrade, because the admin will do it right, making node drain, etc.

Dart-Alex commented 3 years ago

How is there still no option to disable auto updates and only do updates manually ? My production cluster went down today because snap killed a node. I had to work after hours to bring it back up. When ceph runs on a cluster, it needs very manual, and very gentle intervention to bring down a node gracefully. The way snap killed the node made connectivity errors pop up everywhere and I had to hard reboot every single node in the cluster to get back full control. After a year and a half, it's unbelievable that we still can't disable auto updates !

canonical / microk8s

snap auto-refresh breaks cluster #1022