canonical / microceph

Ceph for a one-rack cluster and appliances
https://snapcraft.io/microceph
GNU Affero General Public License v3.0
193 stars 25 forks source link

It's not possible to rebuild a cluster after node failure #350

Open john-terrell opened 1 month ago

john-terrell commented 1 month ago

Issue report

Testing Microceph on a three node cluster. Removing a node (to simulate a failure) and rebuilding it, it's not possible to rejoin the cluster. There's no way to remove the OSDs from the failed node as this attempts to contact the node that failed (using microceph disk remove). Without being able to remove the OSDs, it's not possible to remove the failed node from the cluster (using microceph cluster remove).

What version of MicroCeph are you using ?

18.2.0+snap71f71782c5

What are the steps to reproduce this issue ?

  1. Install Microceph on three nodes
  2. Remove one of the nodes to simulate a node failing.
  3. Unable to remove the failed node from Microceph since removing OSDs tries to contact the failed node.

What happens (observed behaviour) ?

Unable to rejoin the node since Microceph thinks the node already exists. …

What were you expecting to happen ?

Relevant logs, error output, etc.

If it’s considerably long, please paste to https://gist.github.com/ and insert the link here.

Additional comments.