canonical / microceph

MicroCeph is snap-deployed Ceph with built-in clustering
https://snapcraft.io/microceph
GNU Affero General Public License v3.0
228 stars 35 forks source link

can not join additional node to cluster #444

Open daniva6 opened 1 month ago

daniva6 commented 1 month ago

Issue report

What version of MicroCeph are you using ?

18.2.4 reef (stable)

Use this section to describe the channel/revision which produces the unexpected behaviour.

I had to forcefully remove a node due to a hardware failure. Afterwards I wanted to join a new node with the previous node name and ip address, unfortunately it run in a timeout. Now I'm getting the following error: Error: failed to record mon db entries: failed to record mon host: This "config" entry already exists I then tried to add a new node with a new ip address and name, unfortunately same result.

What are the steps to reproduce this issue ?

  1. install new node with 'sudo snap install microceph --channel reef/stable'
  2. create token on a cluster node with 'sudo microceph cluster add '
  3. join cluster on new node with 'sudo microceph cluster join '
  4. Error: failed to record mon db entries: failed to record mon host: This "config" entry already exists

What happens (observed behaviour) ?

Error: failed to record mon db entries: failed to record mon host: This "config" entry already exists …

What were you expecting to happen ?

new node should have joined the cluster …

Relevant logs, error output, etc.

If it’s considerably long, please paste to https://gist.github.com/ and insert the link here.

Additional comments.

I've realized that in the ceph.conf entry 'mon host = ' only the ip of the new hosts are present, the ip's of the two currently operating nodes are missing …

daniva6 commented 1 month ago

following timeout error when trying to add a new node: 'Error: Post "http://control.socket/cluster/control": context deadline exceeded'

nickwales commented 1 month ago

Did you get anywhere with this issue? I have the same problem.

daniva6 commented 1 month ago

Did you get anywhere with this issue? I have the same problem.

Hi NIck, unfortunately not. If someone knows how update the 'mon host =' statement with the correct ip's, that might solve the problem.

UtkarshBhatthere commented 1 month ago

The mon host = statement is populated using the mon.host.$hostname config entries from MicroCeph's internal DQlite table. With a little bit of SQL magic you can remove and insert new entries in the table, using which MicroCeph will repopulate the conf file (in a few mins). See config 3 below and the ceph.conf file.

$ sudo microceph cluster sql "select * from config"
+----+----------------------+------------------------------------------+
| id |         key          |                  value                   |
+----+----------------------+------------------------------------------+
| 1  | fsid                 | a307994e-03ed-4122-9ca3-3bb289af9665     |
| 2  | keyring.client.admin | AQAsCBpnPTElMRAA1CRvvsceRWWdm8f/SByOJw== |
| 3* | mon.host.workbook    | 192.168.29.152                           |
| 4  | public_network       | 192.168.29.152/24                        |
+----+----------------------+------------------------------------------+

$ pwd
/var/snap/microceph/current/conf
$ cat ceph.conf 
# # Generated by MicroCeph, DO NOT EDIT.
[global]
run dir = /var/snap/microceph/current/run
fsid = a307994e-03ed-4122-9ca3-3bb289af9665
mon host = 192.168.29.152
public_network = 192.168.29.152/24
auth allow insecure global id reclaim = false
ms bind ipv4 = true
ms bind ipv6 = false
UtkarshBhatthere commented 1 month ago

@daniva6 @nickwales can you please try the above method (read Hack) and see if that solves it for you ?

UtkarshBhatthere commented 1 month ago

@sabaini #446

daniva6 commented 1 month ago

@UtkarshBhatthere I applied the hack and was able to remove the non-existing nodes. Unfortunately trying to add a new node results now in a timeout error. The microceph commands seem to work (cluster list, status, disk list) but the ceph command hangs, and the node can not bring up the osds anymore.

Is there a possibility to extract data from the disks? Or to import an osd in another cluster?

UtkarshBhatthere commented 2 weeks ago

Hey @daniva6 can you please provide a bit more information 1. sudo microceph cluster sql "select * from config", 2. ceph mon dump, and 3. Hostnames and IP address for member nodes to compare what goes where.

daniva6 commented 2 weeks ago

@UtkarshBhatthere I did set up a new ceph cluster and had to delete the old one for space reasons - this was a good test for my backups :)