canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.37k stars 765 forks source link

microk8s crashes with "FAIL: Service snap.microk8s.daemon-apiserver is not running" #1598

Closed raohammad closed 1 year ago

raohammad commented 3 years ago

Hello, I have installed microk8s 3 node cluster, all works great for a a couple of days but then it crashes for no evident reason to apiserver FAILed.

Below is microk8s inspect output and attached tarball inspection-report-20200925_103006.tar.gz

Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
 **FAIL:  Service snap.microk8s.daemon-apiserver is not running**
For more details look at: sudo journalctl -u snap.microk8s.daemon-apiserver
  Service snap.microk8s.daemon-apiserver-kicker is running
  Service snap.microk8s.daemon-control-plane-kicker is running
  Service snap.microk8s.daemon-proxy is running
  Service snap.microk8s.daemon-kubelet is running
  Service snap.microk8s.daemon-scheduler is running
  Service snap.microk8s.daemon-controller-manager is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster

Building the report tarball
  Report tarball is at /var/snap/microk8s/1719/inspection-report-20200925_103006.tar.gz

This is not first time it has happened. My attempt to deploy a small prod cluster based on microk8s is hindered because of this problem in test environment

balchua commented 3 years ago

@raohammad Thank you for reporting this.
May i know if all nodes' kube-apiserver crashed? Or just one of the node? Thanks!

jeanlucmongrain commented 3 years ago

Same problem here, I investigated a bit, by running sudo /usr/bin/snap run microk8s.daemon-apiserver manually. here is the output:

Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0926 17:47:00.424759 2403934 server.go:647] external host was not specified, using 192.168.1.28
W0926 17:47:00.424979 2403934 authentication.go:484] AnonymousAuth is not allowed with the AlwaysAllow authorizer. Resetting AnonymousAuth to false. You should use a different authorizer
Segmentation fault (core dumped)

So, api-server is segfaulting. That is very annoying as the binary is striped and I can't get a golang stacktrace.

@raohammad please run that command, maybe you have the same problem

balchua commented 3 years ago

@bclermont are you running multiple nodes?
Thanks!

jeanlucmongrain commented 3 years ago

are you running multiple nodes?

yes, 2, but I stop the non-master node to test and it still happens

jeanlucmongrain commented 3 years ago

I'm building k8s 1.19.2 API server from source to get the stack trace

balchua commented 3 years ago

Thanks @bclermont do you mind uploading the inspect tarball on the failing node? All nodes from 1.19 will have the api server.

Cc @ktsakalozos @freeekanayaka Any other information you guys would need from the users? Thanks.

jeanlucmongrain commented 3 years ago

I'm building k8s 1.19.2 API server from source to get the stack trace

I just tried using stock k8s, but it look like it's different than the one shipped with microk8s :(

do you mind uploading the inspect tarball on the failing node

there is a lot of info there that I rather not expose. is there anything in that tarball you need more than other?

jeanlucmongrain commented 3 years ago

I just realized something, snap upgraded microk8s from 1667 to 1710 imediately before the the problem appeared.

As I can't get a binary with the debug symbol for now, trying to downgrade...

raohammad commented 3 years ago

@raohammad Thank you for reporting this. May i know if all nodes' kube-apiserver crashed? Or just one of the node? Thanks!

Out of three nodes, on two its 'FAIL' and on one its runnig. Please let me know if inspect tarball would be of interest from other nodes.

balchua commented 3 years ago

@bclermont MicroK8s embeds dqlite into the apiserver in order to achieve HA without etcd. That's the only difference from upstream.

If you are using a long lasting cluster, we recommend sticking to a more specific kubernetes version channel. Ex 1.18/stable

Moving from one minor kubernetes version or channel usually introduces breaking changes.

freeekanayaka commented 3 years ago

Thanks @bclermont do you mind uploading the inspect tarball on the failing node? All nodes from 1.19 will have the api server.

Cc @ktsakalozos @freeekanayaka Any other information you guys would need from the users? Thanks.

I think it'd be useful to build libdqlite and libraft by passing --debug to ./configure, as discussed with @ktsakalozos. That might provide a bit more info about the crash.

shadowmodder commented 3 years ago

+1 on this issue, unable to turn up the apiserver with segmentation fault.

jeanlucmongrain commented 3 years ago

unable to turn up the apiserver with segmentation fault.

@shadowmodder you should try to downgrade to previous release

I think it'd be useful to build libdqlite and libraft by passing --debug to ./configure, as discussed with @ktsakalozos. That might provide a bit more info about the crash.

I needed to bring back ASAP that cluster, so I followed @balchua advice and reinstall the cluster with 1.18/stable

balchua commented 3 years ago

@bclermont you can use 1.19 multi node by disabling ha-cluster. Using the command microk8s disable ha-cluster before joining the nodes. This will behave like 1.18, where etcd and flannel is used instead of dqlite and calico, ofcourse minus the high availability.

jeanlucmongrain commented 3 years ago

@balchua well, it's too late, I already downgraded to 1.18. I just meant that I can't run a debug enabled version of the binary anymore.

raohammad commented 3 years ago

@bclermont you can use 1.19 multi node by disabling ha-cluster. Using the command microk8s disable ha-cluster before joining the nodes. This will behave like 1.18, where etcd and flannel is used instead of dqlite and calico, ofcourse minus the high availability.

@balchua

node1:~$ microk8s disable ha-cluster
Addon ha-cluster is already disabled.
ubuntu@node1:~$ microk8s stop
Stopped.
ubuntu@node1:~$ microk8s start
Started.
ubuntu@node1:~$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
ubuntu@node1:~$ microk8s inspect
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
 FAIL:  Service snap.microk8s.daemon-apiserver is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-apiserver
  Service snap.microk8s.daemon-apiserver-kicker is running
  Service snap.microk8s.daemon-control-plane-kicker is running
  Service snap.microk8s.daemon-proxy is running
  Service snap.microk8s.daemon-kubelet is running
  Service snap.microk8s.daemon-scheduler is running
  Service snap.microk8s.daemon-controller-manager is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster

Building the report tarball
  Report tarball is at /var/snap/microk8s/1719/inspection-report-20200927_095354.tar.gz
balchua commented 3 years ago

@raohammad im sorry about this one. The apiserver is still failing.

ktsakalozos commented 3 years ago

@raohammad, @bclermont, @VaticanUK we are actively working on the issue. Here is a summary of what we have done so far.

Thank you @balchua, @freeekanayaka, @devec0 for your efforts.

The above fixes/enhancements are available from the latest/edge and 1.19/edge channels. Any feedback you could give us on those channels would be much appreciated.

digitalrayne commented 3 years ago

Updated to latest/edge and will monitor. The refresh went well. My cluster had fallen apart again over the weekend on 1724, I've now refreshed all nodes to 1730 and will be sure to provide any feedback here and on #1578 - thanks again @freeekanayaka and @ktsakalozos for all of your hard work on this, dqlite is a great alternative to etcd (especially on smaller systems) and I can see the stability of the technology growing before my eyes :)

VaticanUK commented 3 years ago

Looking good so far! image

Thanks all, i'll update with any further issues if they happen!

chris-sanders commented 3 years ago

inspection-report-20200928_185745.tar.gz I've got a 4 node cluster that was running 1710 revision that has triggered this on 2 of the nodes. I've attached an inspection report.

I've tried to refresh to 1.19/edge, which fails with the following.

# snap refresh microk8s --channel=1.19/edge
error: cannot perform the following tasks:
- Run configure hook of "microk8s" snap if present (run hook "configure":
-----
++ date +%s
+ start_timer=1601322854
+ timeout=120
+ KUBECTL='/snap/microk8s/1732/kubectl --kubeconfig=/var/snap/microk8s/1732/credentials/client.config'
+ sleep 5
++ date +%s
+ now=1601322866
+ [[ 1601322866 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322871
+ [[ 1601322871 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322876
+ [[ 1601322876 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322881
+ [[ 1601322881 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322886
+ [[ 1601322886 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322892
+ [[ 1601322892 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322897
+ [[ 1601322897 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322902
+ [[ 1601322902 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322907
+ [[ 1601322907 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322912
+ [[ 1601322912 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322917
+ [[ 1601322917 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322922
+ [[ 1601322922 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322927
+ [[ 1601322927 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322932
+ [[ 1601322932 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322937
+ [[ 1601322937 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322943
+ [[ 1601322943 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322948
+ [[ 1601322948 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322953
+ [[ 1601322953 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322958
+ [[ 1601322958 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322963
+ [[ 1601322963 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322968
+ [[ 1601322968 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322973
+ [[ 1601322973 > 1601322974 ]]
+ sleep 5
++ date +%s
+ now=1601322978
+ [[ 1601322978 > 1601322974 ]]
+ break
+ /snap/microk8s/1732/kubectl --kubeconfig=/var/snap/microk8s/1732/credentials/client.config apply -f /var/snap/microk8s/1732/args/cni-network/cni.yaml
The connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port?
-----)

I'm considering kicking the nodes out of the cluster and re-joining at this point, but I'd like to see if there's anything else I can do while I'm in this state. Maybe I'll try to get one node re-joined and leave the other broken for further troubleshooting.

digitalrayne commented 3 years ago

You might want to check out your dqlite data first, and see if you can restore it to a running state before refreshing. In the past I've also found you can't jump too many revisions forward without triggering that behaviour. When I originally switched from 1.19 to latest, I had to re-bootstrap after a microk8s reset, but that was also before I worked out how to fix up the dqlite data in #1578 - so you could give the linked docs a try, including some of @freeekanayaka's advice as well, and see if that gets your cluster running again before trying to refresh. IIRC, 16443 is the apiserver port and won't come up until the datastore (dqlite in our case) is running.

chris-sanders commented 3 years ago

So far I have:

At this point inspect is not reporting the api server is down, however I'm still on the 1710 revision, and the node doesn't actually come online.

I have a copy of the bad /backend folder, but I've kind of run out of ideas at this point. If there's more I can do point me in the right direction. I'm willing to leave this node down/out for a little while to see if we can trace down what's going on.

chris-sanders commented 3 years ago

It was pointed out that I didn't stop the 'good' node before making the backend backup. I repeated the backend backup after draining the node. I also added -v 50 (larger is more verbose?) to the api-server args.

After the restore, microk8s status still isn't happy, and I now get more logging. Here's the microk8s.daemon-apiserver log. https://pastebin.ubuntu.com/p/Zf3xSKk359/

That seems to be repeating and I expect it will eventually just stop retrying and fail.

raohammad commented 3 years ago

@raohammad, @bclermont, @VaticanUK we are actively working on the issue. Here is a summary of what we have done so far.

Thank you @balchua, @freeekanayaka, @devec0 for your efforts.

The above fixes/enhancements are available from the latest/edge and 1.19/edge channels. Any feedback you could give us on those channels would be much appreciated.

I have upgrade the cluster with latest/edge v1.19.2-34+340f0ec18a2657 so far so good. Normally it would happen in 2 to 3 days. Hopefully we can be certain by end of this week

chris-sanders commented 3 years ago

I'm not sure there's much more I'm getting out of the node I have crashing right now. I'm going to proceed to wipe it and re-add it to the cluster. I'll roll the cluster and move to /edge to see if I get a reproducer. I got it on 2 nodes at the same time but didn't see this for ~20 days so I'm fairly sure it's due to an unplanned power outage and hard reboot, although the machine is running on a NVME with guaranteed write protection and only K8s seems to have noticed.

VaticanUK commented 3 years ago

I just had a node break and go unavailable again, but it came back up.

I noticed it went down so I drained it to try and get my pods to reschedule on a different node, but then as soon as it was drained it came back to ready, so I uncordened it and it's all happy again!

logs etc attached: inspection-report-20200930_200024.tar.gz

mfpnca commented 3 years ago

Same issues and steps as Chris. But as VaticanUK has done, I was able to drain the offending node and re-join it with success (so far). The node was up for about 13 days, and may have become an issue when the host went down for maintenance. This is one node in a four-node cluster on v1.19.2-34+1b3fa60b402c1c.

ktsakalozos commented 3 years ago

In case you follow the latest/edge channel you may observe service restarts as a result of code changes in the master branch of this repository (reflected in snap refreshes). These restarts should not take longer than a few minutes. If a node does not recover from such a restart we will need to look into it. Also, if you believe that these restarts affect the "experiment" we conduct here, I can create a channel branch we could follow.

VaticanUK commented 3 years ago

@ktsakalozos that could be it in my case, but i only noticed it on one node? Maybe it happened on the other nodes when I wasn't watching?

Aaron-Ritter commented 3 years ago

it is probably good practice, especially for systems which need to be more reliable, to modify the snap automatic refresh e.g.: sudo snap set system refresh.timer=sun,00:00~01:00

VaticanUK commented 3 years ago

I'll do that now, should at least prevent confusion in the future! :)

VaticanUK commented 3 years ago

This evening I noticed that my cluster is totally unreachable by the k8s API. Logged into my first node and ran microk8s status. The response took quite a long time and came back: microk8s is not running. Use microk8s inspect for a deeper inspection.

microk8s inspect took an absolute age to complete. Nothing obviously wrong with the output to console (though status would suggest otherwise!) and the tarball is attached:

inspection-report-20201003_213248.tar.gz

microk8s inspect for my 4th node (hot backup, but never seems to have been brought in to play...) completed fairly quickly:

inspection-report-20201003_215022.tar.gz

the inspect command on my other two nodes is still running (it must have been running for about 30 mins now...)

One very odd thing which I did notice, ssh'ing into my nodes caused a ssh host key validation failure for all 4 nodes. I'm fairly certain that I've ssh'd into them from the laptop i'm using to write this since I set up them up, but I can't think of any reason why the host key on all 4 nodes would have changed? Could be totally unrelated, but figured it may be worth mentioning.

Also worth mentioning that although kubectl can't talk to the cluster, and I can't access some services, some other services do still seem to be running - I have mosquitto running on there and node-red, and although I can't access mosquitto, some of my node-red flows which act upon topics in mosquitto seem to be running correctly...

I'll comment again later (or tomorrow!) with the logs from the other two nodes

VaticanUK commented 3 years ago

Here is the tarball from my second name - looks like it took about 2.5 hours to complete:

inspection-report-20201004_004907.tar.gz

the command still hasn't completed on my third node so i've kicked it off again

balchua commented 3 years ago

Thanks @VaticanUK for providing valuable info , i did a quick check on this inspection-report-20201003_215022.tar.gz, i noticed that it doesn't have the same specs as the other 2. The memory seems to be around 1G.

While the last one inspection-report-20201004_004907.tar.gz available memory is almost used up.

It could be just the effects of whats happenning.

VaticanUK commented 3 years ago

hmm memory usage is quite high on some nodes: 1: image

2: image

3: image

4: image

and yes, i have 2 nodes with 4GB, plus 1 node and my hot backup have 2GB. I thought K8s could handle this and would take memory pressure into account when scheduling pods?

Given at least one of the nodes (node 2) memory usage is only around 50%, if there was an issue with memory usage on node 1 which would cause problems with it being the master, I would expect the HA aspect to move master to node 2? Is that not the case? Does the node have to be totally offline for this to happen?

balchua commented 3 years ago

I don't know if the HA control plane components are cpu or memory aware and smart enough to take decisions which one will be made part of the voting nodes.

Looking at the screenshot you provided, the load average of node 1 and 3 is extremely high too. Do you know which process is using up most of the cpu and memory?

Thanks again!

VaticanUK commented 3 years ago

node1 - seems to be kube-apiserver for both : image

node2 - the same: image

node3 - memory is kube-apiserver, cpu is kubelet: image

I won't bother posting node4 since it's not really doing anything (being a hot backup)

VaticanUK commented 3 years ago

To update a bit further on this, I rebooted nodes 1 and 2 (which took far longer than I expected!) and when they both came back up everything came back up fine

ktsakalozos commented 3 years ago

@VaticanUK @freeekanayaka node 4 seems to be failing to start its apiserver and dqlite with an error in starting raft:

Oct 02 11:29:54 k8snode4 microk8s.daemon-apiserver[3603097]: W1002 11:29:54.429640 3603097 authentication.go:484] AnonymousAuth is not allowed with the AlwaysAllow authorizer. Resetting AnonymousAuth to false. You should use a different authorizer
Oct 02 11:29:54 k8snode4 microk8s.daemon-apiserver[3603097]: Error: start node: raft_start(): io: load closed segment 0000000000003726-0000000000003822: found 96 entries (expected 97)
Oct 02 11:29:54 k8snode4 systemd[1]: snap.microk8s.daemon-apiserver.service: Main process exited, code=exited, status=1/FAILURE

@freeekanayaka is there a a way to fix this (eg drop some of the corrupted segments) or should we do a reinstall of that node (microk8s leave and microk8s remove node)?

freeekanayaka commented 3 years ago

@VaticanUK @freeekanayaka node 4 seems to be failing to start its apiserver and dqlite with an error in starting raft:

Oct 02 11:29:54 k8snode4 microk8s.daemon-apiserver[3603097]: W1002 11:29:54.429640 3603097 authentication.go:484] AnonymousAuth is not allowed with the AlwaysAllow authorizer. Resetting AnonymousAuth to false. You should use a different authorizer
Oct 02 11:29:54 k8snode4 microk8s.daemon-apiserver[3603097]: Error: start node: raft_start(): io: load closed segment 0000000000003726-0000000000003822: found 96 entries (expected 97)
Oct 02 11:29:54 k8snode4 systemd[1]: snap.microk8s.daemon-apiserver.service: Main process exited, code=exited, status=1/FAILURE

@freeekanayaka is there a a way to fix this (eg drop some of the corrupted segments) or should we do a reinstall of that node (microk8s leave and microk8s remove node)?

Yes, for this particular bug, you should be able to fix it by deleting the offended segment file 0000000000003726-0000000000003822 and all the ones that follow (i.e. the ones that have greater numbers).

VaticanUK commented 3 years ago

thanks @freeekanayaka - where do I find those files please?

VaticanUK commented 3 years ago

nm, found them :)

freeekanayaka commented 3 years ago

Should be /var/snap/microk8s/current/var/kubernetes/backend for reference.

VaticanUK commented 3 years ago

That work thanks @freeekanayaka - left me with nodes 1, 2 and 4 working (node 3 hasn't been ready since the problems I had over the weekend).

So this afternoon I tried getting node 3 back up.

First I removed it from the cluster, then added it again, and it stayed unhappy. Next I removed it from the cluster, uninstalled microk8s, cleared down journalctl to make it easier to see issues, installed microk8s and added back to the cluster.

That seemed to work, but then the apiserver and kubectl started taking a long time to respond to anything, node 3 never came ready, node 4 became non-ready.

So I kicked off microk8s inspect on each node.

Inspect took a while on node 1 (below), but a few hours later it's not yet completed on any other node. node 1 - inspection-report-20201005_141401.tar.gz

At some point during the time inspect was running on node 1, node 1 totally stopped responding via the API and kubectl (which I think uses the API anyway?), so I kicked off another inspect in case the first one missed that last bit: node 1 - inspection-report-20201005_142630.tar.gz

My guess is that if I reboot the nodes, they'll (mostly!) come back up fine, but I'll wait until inspect has finished on all 4 nodes so that I can upload them here (just in case there is something useful there!)

VaticanUK commented 3 years ago

inspect has finished on nodes 2 and 3, but 4 is still going... node 2 - inspection-report-20201005_170456.tar.gz node 3 - inspection-report-20201005_162436.tar.gz

VaticanUK commented 3 years ago

Rebooted the nodes, they didn't come back up.

microk8s status said microk8s wasnt' running microk8s inspect said everything looked fine looking in the logs there wasn't really anything to see except:

Oct 05 22:12:16 k8snode1 microk8s.daemon-apiserver[2336]: I1005 22:12:16.964595    2336 server.go:647] external host was not specified, using 192.168.8.8
Oct 05 22:13:52 k8snode1 microk8s.daemon-apiserver[2336]: Error: context deadline exceeded
Oct 05 22:13:52 k8snode1 systemd[1]: snap.microk8s.daemon-apiserver.service: Main process exited, code=exited, status=1/FAILURE

Left it overnight, thinking I'd probably need to rebuild it today, but this morning it seems almost happy.

ubuntu@k8snode1:~$ microk8s kubectl get node
NAME       STATUS     ROLES    AGE     VERSION
k8snode4   NotReady   <none>   20h     v1.19.2-34+867242ed63887f
k8snode3   Ready      <none>   18h     v1.19.2-34+a6fde53360a5d7
k8snode1   Ready      <none>   7d17h   v1.19.2-34+a6fde53360a5d7
k8snode2   Ready      <none>   7d17h   v1.19.2-34+a6fde53360a5d7
ktsakalozos commented 3 years ago

@VaticanUK the node2 and node3 logs attached above show the apiserver not so healthy. @freeekanayaka there are a few stack traces you may want to look. However what I think is happening is your setup is running out of memory and the OS is killing processes. Would you be interested in adding some swap so the OS can offload unused pages? Normally you do not add swap in k8s nodes because it "confuses" pod scheduling.

VaticanUK commented 3 years ago

sure, I thought k8s wouldn't work with swap enabled, but maybe it's just not recommended from what you say? Is there anything special I need to do, or just re-enable it?

I'm not so convinced it is running out of memory. I may have previously but after the previous discussion around memory I added resource limits to all of my deployment, with the total maximum being 3.6Gi (so it should fit on either node 1 or 2 on their own, overheads aside!). I haven't added resource limits to the deployments which are created by microk8s but I was keeping an eye on things yesterday the memory usage never looked too high.

This is what it looks like now: image

and it never strayed very far from that yesterday. Even when I noticed things were starting to look unhealthy yesterday, the memory usage looked not far off that. You could still be right though so happy to try and investigate that possibility further :)

VaticanUK commented 3 years ago

thinking further, it's also interesting to note that things started going screwy yesterday when I was trying to get node3 back in the cluster. I'm going to try and get node4 happy again later so I'll see if something similar happens when I do that