HA cluster becomes not available

derritter88 commented 11 months ago

Summary

If I reboot my server or shut it down without microk8s stop my HA cluster becomes not available. In general I have four identical VMs and a Raspi 4.

I shutdown two VMs leaving three Microk8s nodes online and available. I had to restart one of the powered off VMs to microk8s stop to leave the cluster available and powered it off afterwards.

What Should Happen Instead?

If a VM, server or whatever goes unavailable without microk8s stop the cluster itself should recognise this and being still available.

Reproduction Steps

Run five microk8s nodes in HA config.
Power down two of them without any microk8s stop.

Are you interested in contributing with a fix?

Yes

neoaggelos commented 11 months ago

Hi @derritter88

Thank you for creating this issue. I wonder if it would be possible to provide some more information under which circumstances this occurs.

For some background, a 5-node dqlite cluster comprises of the following node types:

1 leader (raft leader)
3-1 = 2 voters (raft voters)
5-3 = 2 standby nodes.

The standby nodes do not participate in consensus, but rather only stream the raft logs. When one voter goes down, one of the standby's will typically take its place.

I wonder if the problem you experienced stems from a scenario where 2 voters simultaneously go down, therefore the rest of the nodes do not have quorum to promote a different node. I also think that, for consistency's sake, this is a valid approach (e.g. to cover for cases of a network segmentation).

Could you check whether the type of the nodes when you are testing the failure scenario? An easy way would be to check the contents of /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml, which includes the role of each node.

I believe one solution to this would be to tell dqlite to use 5 voters instead of 3. There is a configuration option for this in dqlite, but there is no configuration option for MicroK8s atm. Definitely something for us to consider.

Can you let me know if the above help @derritter88? Thanks! Happy to discuss any questions/ideas you might have.

derritter88 commented 11 months ago

Hello @neoaggelos , thanks for your reply!

I wonder if the problem you experienced stems from a scenario where 2 voters simultaneously go down, therefore the rest of the nodes do not have quorum to promote a different node. I also think that, for consistency's sake, this is a valid approach (e.g. to cover for cases of a network segmentation).

Yes, this can totally happen. My layout in general from a physical view is: 2x Proxmox server, each with 2x Ubuntu 22.04 Microk8s VMs 1x Rasperry Pi 4 8GB RAM.

If both voters would be on the same Proxmox server which goes down then the master would still be there but no voters.

The content of /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml would be:

- Address: 192.168.20.80:19001
  ID: 3297041220608546238
  Role: 0
- Address: 192.168.20.81:19001
  ID: 8332254559105726244
  Role: 0
- Address: 192.168.20.82:19001
  ID: 15845298156184356509
  Role: 1
- Address: 192.168.20.83:19001
  ID: 3377626138330688231
  Role: 1
- Address: 192.168.20.84:19001
  ID: 4183241747773521660
  Role: 0

Additional microk8s status says:

  datastore master nodes: 192.168.20.80:19001 192.168.20.81:19001 192.168.20.84:19001
  datastore standby nodes: 192.168.20.82:19001 192.168.20.83:19001

This would indicate that if IPs .80 & .81 (= two VMs on the same Proxmox) become unavailable, then the whole cluster will stop.

I believe one solution to this would be to tell dqlite to use 5 voters instead of 3. There is a configuration option for this in dqlite, but there is no configuration option for MicroK8s atm. Definitely something for us to consider. Is there any guidelines how to set this up within dqlite?

neoaggelos commented 11 months ago

dqlite accepts a WithVoters() option, as shown in https://github.com/canonical/go-dqlite/blob/beebd0121cfa366ebf3cbb9cf9e807af812aa38e/app/options.go#L103. Though I think this needs code support and cannot be set manually on a running cluster.

An alternative option would be to configure a failure domain for each of your nodes, then dqlite can prioritize nodes in different failure domains as voters.

Our documentation for this is a bit unclear at the moment. In general, this is a per-node configuration and could look like this (assuming ssh access is available). The failure domain would be an arbitrary integer value (use the same value for nodes in the same failure domain). Example:

# set failure domain of first two VMs to '100'
echo failure-domain=100 | ssh 192.168.20.80 sudo tee /var/snap/microk8s/current/var/kubernetes/backend/ha-conf
echo failure-domain=100 | ssh 192.168.20.81 sudo tee /var/snap/microk8s/current/var/kubernetes/backend/ha-conf
# set failure domain of other two VMs to '101'
echo failure-domain=101 | ssh 192.168.20.82 sudo tee /var/snap/microk8s/current/var/kubernetes/backend/ha-conf
echo failure-domain=101 | ssh 192.168.20.83 sudo tee /var/snap/microk8s/current/var/kubernetes/backend/ha-conf
# set failure domain of rpi to '102'
echo failure-domain=102 | ssh 192.168.20.84 sudo tee /var/snap/microk8s/current/var/kubernetes/backend/ha-conf

Then, restart the k8s-dqlite services (one by one):

ssh 192.168.20.80 sudo systemctl restart snap.microk8s.daemon-kubelite
ssh 192.168.20.80 sudo systemctl restart snap.microk8s.daemon-k8s-dqlite
# ....

In the service logs, you should see the failure domain of each node.

Support for these kind of setups are not too easy atm, as you might understand, but that is something we could maybe work on for MicroK8s 1.29

derritter88 commented 11 months ago

Thanks for the information. I have added failure domain information per node.

Only sudo systemctl restart microk8s.daemon-k8s-dqlite does not work as I have installed it via snap.

derritter88 commented 11 months ago

I have tried to do a reboot but after the reboot my command was gone echo 100 | sudo tee /var/snap/microk8s/current/var/kubernetes/backend/failure-domain

Now this value is back on 1

neoaggelos commented 11 months ago

Hi @derritter88

You are right, apologies. I have updated the commands above. See also https://microk8s.io/docs/high-availability for more details around this.

derritter88 commented 11 months ago

Hello @neoaggelos ,

thanks this works. May I suggest in general that it might be a good approach that any node who is not currently acting as a master becomese kind of a witness. So if two masters at once go down with one remaing two new masters are getting elected.

neoaggelos commented 11 months ago

May I suggest in general that it might be a good approach that any node who is not currently acting as a master becomese kind of a witness. So if two masters at once go down with one remaing two new masters are getting elected.

This is tricky part, and could be catastrophic in cases where there is a network segmentation between the nodes. Imagine the scenario where you have nodes A, B, C (voters), and D, E (standby). If for some reason, nodes B and C become segmented from A, D and E, then you lose consistency:

A, D, E <- they would reelect new leaders since they think they have quorum
B, C <- they still have quorum, so will keep accepting writes etc

When the network issue resolves, the cluster would be in a broken state, and that is not acceptable. The solution would be to start with configuring 5 voters, so that it is clear which of the ADE or BC segments have quorum.

Sorry, it's late afternoon of a long day, hope the above makes sense.

derritter88 commented 11 months ago

Okay this is understandable.

So in general what would be the proper way if I would need to shutdown/reboot one of my physical servers? So far I only drained the nodes within Kubernetes but didn't stop microk8s itself.

stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

canonical / microk8s