Closed fzhan closed 1 month ago
Hi @fzhan thanks for the bug report.
Some questions:
sudo ceph -s
i.e. the ceph status command?Thanks in advance!
Hi @sabaini, I will try to provide as much as possible:
microceph disk add /dev/sdb --wipe
and they were added successfully.For ceph -s
root@node1:/home/admin# ceph -s
cluster:
id: 44be5699-79bc-465e-9bd4-07afced519e2
health: HEALTH_WARN
Degraded data redundancy: 10094/560481 objects degraded (1.801%), 3 pgs degraded, 3 pgs undersized
1 pgs not deep-scrubbed in time
2 pgs not scrubbed in time
services:
mon: 4 daemons, quorum node1,node3,node4,node2 (age 3h)
mgr: node2(active, since 6d), standbys: node3, node1, node5
mds: 1/1 daemons up, 2 standby
osd: 5 osds: 5 up (since 6d), 5 in (since 6d)
data:
volumes: 1/1 healthy
pools: 4 pools, 81 pgs
objects: 186.83k objects, 21 GiB
usage: 87 GiB used, 8.7 TiB / 8.7 TiB avail
pgs: 10094/560481 objects degraded (1.801%)
78 active+clean
3 active+undersized+degraded
io:
client: 764 KiB/s rd, 357 KiB/s wr, 2 op/s rd, 21 op/s wr
Both node 1 and node 2 are missing, here's the log
root@node1:/home/admin# tail /var/snap/microceph/common/logs/ceph-osd.24.log
2024-08-19T09:44:54.463+1000 7fa0c06b28c0 1 mClockScheduler: set_osd_capacity_params_from_config: osd_bandwidth_cost_per_io: 499321.90 bytes/io, osd_bandwidth_capacity_per_shard 31457280.00 bytes/second
2024-08-19T09:44:54.463+1000 7fa0c06b28c0 0 osd.24:4.OSDShard using op scheduler mClockScheduler
2024-08-19T09:44:54.495+1000 7fa0c06b28c0 -1 bluestore(/var/lib/ceph/osd/ceph-24/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-08-19T09:44:54.527+1000 7fa0c06b28c0 -1 bluestore(/var/lib/ceph/osd/ceph-24/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-08-19T09:44:54.543+1000 7fa0c06b28c0 1 bdev(0x560a85e91500 /var/lib/ceph/osd/ceph-24/block) open path /var/lib/ceph/osd/ceph-24/block
2024-08-19T09:44:54.543+1000 7fa0c06b28c0 1 bdev(0x560a85e91500 /var/lib/ceph/osd/ceph-24/block) open size 536870912000 (0x7d00000000, 500 GiB) block_size 4096 (4 KiB) non-rotational device, discard supported
2024-08-19T09:44:54.543+1000 7fa0c06b28c0 -1 bluestore(/var/lib/ceph/osd/ceph-24/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-08-19T09:44:54.543+1000 7fa0c06b28c0 1 bdev(0x560a85e91500 /var/lib/ceph/osd/ceph-24/block) close
2024-08-19T09:44:54.843+1000 7fa0c06b28c0 -1 osd.24 0 OSD:init: unable to mount object store
2024-08-19T09:44:54.843+1000 7fa0c06b28c0 -1 ** ERROR: osd init failed: (2) No such file or directory
root@node1:/home/admin# tail /var/snap/microceph/common/logs/ceph-osd.29.log
2024-08-19T11:13:10.652+1000 7f9957381640 0 bluestore(/var/lib/ceph/osd/ceph-29) probe -9: 0, 0, 0
2024-08-19T11:13:10.652+1000 7f9957381640 0 bluestore(/var/lib/ceph/osd/ceph-29) probe -17: 0, 0, 0
2024-08-19T11:13:10.652+1000 7f9957381640 0 bluestore(/var/lib/ceph/osd/ceph-29) ------------
2024-08-19T11:13:10.652+1000 7f995bbc38c0 4 rocksdb: [db/db_impl/db_impl.cc:496] Shutdown: canceling all background work
2024-08-19T11:13:10.652+1000 7f995bbc38c0 4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
2024-08-19T11:13:10.652+1000 7f995bbc38c0 1 bluefs umount
2024-08-19T11:13:10.652+1000 7f995bbc38c0 1 bdev(0x561135fec700 /var/lib/ceph/osd/ceph-29/block) close
2024-08-19T11:13:10.896+1000 7f995bbc38c0 1 freelist shutdown
2024-08-19T11:13:10.896+1000 7f995bbc38c0 1 bdev(0x561135fec000 /var/lib/ceph/osd/ceph-29/block) close
2024-08-19T11:13:11.040+1000 7f995bbc38c0 0 created object store /var/lib/ceph/osd/ceph-29 for osd.29 fsid 44be5699-79bc-465e-9bd4-07afced519e2
Here's the log for node 2:
root@node2:/var/virtual_disk# tail /var/snap/microceph/common/logs/ceph-osd.27.log
2024-08-19T10:20:15.579+1000 7ff56d2f9640 0 bluestore(/var/lib/ceph/osd/ceph-27) probe -9: 0, 0, 0
2024-08-19T10:20:15.579+1000 7ff56d2f9640 0 bluestore(/var/lib/ceph/osd/ceph-27) probe -17: 0, 0, 0
2024-08-19T10:20:15.579+1000 7ff56d2f9640 0 bluestore(/var/lib/ceph/osd/ceph-27) ------------
2024-08-19T10:20:15.583+1000 7ff571b3b8c0 4 rocksdb: [db/db_impl/db_impl.cc:496] Shutdown: canceling all background work
2024-08-19T10:20:15.583+1000 7ff571b3b8c0 4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
2024-08-19T10:20:15.595+1000 7ff571b3b8c0 1 bluefs umount
2024-08-19T10:20:15.595+1000 7ff571b3b8c0 1 bdev(0x55e482772700 /var/lib/ceph/osd/ceph-27/block) close
2024-08-19T10:20:15.883+1000 7ff571b3b8c0 1 freelist shutdown
2024-08-19T10:20:15.883+1000 7ff571b3b8c0 1 bdev(0x55e482772000 /var/lib/ceph/osd/ceph-27/block) close
2024-08-19T10:20:16.115+1000 7ff571b3b8c0 0 created object store /var/lib/ceph/osd/ceph-27 for osd.27 fsid 44be5699-79bc-465e-9bd4-07afced519e2
root@node2:/var/virtual_disk# tail /var/snap/microceph/common/logs/ceph-osd.28.log
2024-08-19T10:24:11.896+1000 7ff81b6b6640 0 bluestore(/var/lib/ceph/osd/ceph-28) probe -9: 0, 0, 0
2024-08-19T10:24:11.896+1000 7ff81b6b6640 0 bluestore(/var/lib/ceph/osd/ceph-28) probe -17: 0, 0, 0
2024-08-19T10:24:11.896+1000 7ff81b6b6640 0 bluestore(/var/lib/ceph/osd/ceph-28) ------------
2024-08-19T10:24:11.896+1000 7ff81fef88c0 4 rocksdb: [db/db_impl/db_impl.cc:496] Shutdown: canceling all background work
2024-08-19T10:24:11.896+1000 7ff81fef88c0 4 rocksdb: [db/db_impl/db_impl.cc:704] Shutdown complete
2024-08-19T10:24:11.896+1000 7ff81fef88c0 1 bluefs umount
2024-08-19T10:24:11.896+1000 7ff81fef88c0 1 bdev(0x556dc4698700 /var/lib/ceph/osd/ceph-28/block) close
2024-08-19T10:24:12.172+1000 7ff81fef88c0 1 freelist shutdown
2024-08-19T10:24:12.172+1000 7ff81fef88c0 1 bdev(0x556dc4698000 /var/lib/ceph/osd/ceph-28/block) close
2024-08-19T10:24:12.324+1000 7ff81fef88c0 0 created object store /var/lib/ceph/osd/ceph-28 for osd.28 fsid 44be5699-79bc-465e-9bd4-07afced519e2
here's ceph health detail, seems like they've been stuck for a while:
HEALTH_WARN Degraded data redundancy: 10492/587073 objects degraded (1.787%), 3 pgs degraded, 3 pgs undersized; 2 pgs not deep-scrubbed in time; 2 pgs not scrubbed in time
[WRN] PG_DEGRADED: Degraded data redundancy: 10492/587073 objects degraded (1.787%), 3 pgs degraded, 3 pgs undersized
pg 1.0 is stuck undersized for 7d, current state active+undersized+degraded, last acting [7,1]
pg 4.7 is stuck undersized for 7d, current state active+undersized+degraded, last acting [16,1]
pg 4.13 is stuck undersized for 7d, current state active+undersized+degraded, last acting [16,1]
[WRN] PG_NOT_DEEP_SCRUBBED: 2 pgs not deep-scrubbed in time
pg 4.13 not deep-scrubbed since 2024-08-12T09:39:35.563752+1000
pg 4.7 not deep-scrubbed since 2024-08-13T16:25:05.685339+1000
[WRN] PG_NOT_SCRUBBED: 2 pgs not scrubbed in time
pg 4.13 not scrubbed since 2024-08-13T16:25:19.422366+1000
pg 4.7 not scrubbed since 2024-08-13T16:25:05.685339+1000
For removing OSDs from MicroCeph please use microceph disk remove
-- cf. https://canonical-microceph.readthedocs-hosted.com/en/latest/reference/commands/disk/#remove
Otherwise MicroCeph won't know about removed OSDs. While you can remove OSDs directly via Ceph this is not recommended for use in MicroCeph and managing these OSDs with MicroCeph is not supported.
Wrt. to the cluster health as you say it seems to be unhealthy, even though all expected OSDs are up. Is it possible that one OSD got removed accidentally?
One thing that strikes me as odd from your microceph disk list
output is that some nodes have /dev/sdi
configured -- which is strange as MicroCeph tries to normalize disk paths to point to a /dev/disk/by-id
device. How did those get added?
@sabaini one of the disks failed and a node had to be removed, later on the node came back with a newly installed system. Cluster went through bad node / disk management and Yes and everything was done in a non-professional manner.
Hence the cluster being unhealth.
As for how the sdi were added, I followed the instruction listed here: https://microk8s.io/docs/how-to-ceph#add-virtual-disks
Ah, I see, so /dev/sdi refers to loop devices -- they likely won't have a by-id entry. That makes sense.
For avoidance of doubt do keep in mind while loop devices can act fine as OSDs, they of course won't give you the redundancy and safety of a real disk
Note that by executing microceph disk remove
you could bring MicroCephs record of OSDs in sync with what is actually in the cluster, but you'd have to pass the --bypass-safety-checks
flag -- otherwise MC will check if the OSD is present in the cluster. Do be careful with the flag, as it says on the tin it'll remove what you specify no questions asked
@sabaini thanks for the update. Unfortunately, even after repeating 'remove' & 'add' actions, the OSD list in ceph remained the same. Is there any other ways to get the cluster out of stuck mode?
here's latest log from the mgr node:
Aug 28 12:44:42 node2.prompcorp.com.au systemd[1]: Started Service for snap application microceph.mgr.
Aug 28 12:44:42 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:42.304+1000 7fdc71925000 -1 mgr[py] Module alerts has missing NOTIFY_TYPES member
Aug 28 12:44:42 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:42.424+1000 7fdc71925000 -1 mgr[py] Module balancer has missing NOTIFY_TYPES member
Aug 28 12:44:42 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:42.636+1000 7fdc71925000 -1 mgr[py] Module crash has missing NOTIFY_TYPES member
Aug 28 12:44:43 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:43.668+1000 7fdc71925000 -1 mgr[py] Module devicehealth has missing NOTIFY_TYPES member
Aug 28 12:44:43 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:43.772+1000 7fdc71925000 -1 mgr[py] Module influx has missing NOTIFY_TYPES member
Aug 28 12:44:43 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:43.980+1000 7fdc71925000 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
Aug 28 12:44:44 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:44.524+1000 7fdc71925000 -1 mgr[py] Module nfs has missing NOTIFY_TYPES member
Aug 28 12:44:44 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:44.804+1000 7fdc71925000 -1 mgr[py] Module orchestrator has missing NOTIFY_TYPES member
Aug 28 12:44:44 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:44.992+1000 7fdc71925000 -1 mgr[py] Module osd_perf_query has missing NOTIFY_TYPES member
Aug 28 12:44:45 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:45.092+1000 7fdc71925000 -1 mgr[py] Module osd_support has missing NOTIFY_TYPES member
Aug 28 12:44:45 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:45.288+1000 7fdc71925000 -1 mgr[py] Module pg_autoscaler has missing NOTIFY_TYPES member
Aug 28 12:44:45 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:45.392+1000 7fdc71925000 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
Aug 28 12:44:45 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:45.820+1000 7fdc71925000 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
Aug 28 12:44:45 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:45.988+1000 7fdc71925000 -1 mgr[py] Module rbd_support has missing NOTIFY_TYPES member
Aug 28 12:44:46 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:46.628+1000 7fdc71925000 -1 mgr[py] Module selftest has missing NOTIFY_TYPES member
Aug 28 12:44:46 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:46.744+1000 7fdc71925000 -1 mgr[py] Module snap_schedule has missing NOTIFY_TYPES member
Aug 28 12:44:47 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:47.044+1000 7fdc71925000 -1 mgr[py] Module status has missing NOTIFY_TYPES member
Aug 28 12:44:47 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:47.148+1000 7fdc71925000 -1 mgr[py] Module telegraf has missing NOTIFY_TYPES member
Aug 28 12:44:47 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:47.432+1000 7fdc71925000 -1 mgr[py] Module telemetry has missing NOTIFY_TYPES member
Aug 28 12:44:47 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:47.716+1000 7fdc71925000 -1 mgr[py] Module test_orchestrator has missing NOTIFY_TYPES member
Aug 28 12:44:48 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:48.060+1000 7fdc71925000 -1 mgr[py] Module volumes has missing NOTIFY_TYPES member
Aug 28 12:44:48 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:48.164+1000 7fdc71925000 -1 mgr[py] Module zabbix has missing NOTIFY_TYPES member
Aug 28 12:44:48 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:48.220+1000 7fdc6c903640 -1 mgr handle_mgr_map I was active but no longer am
Aug 28 12:44:48 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:48.424+1000 7f5a5b1ee000 -1 mgr[py] Module alerts has missing NOTIFY_TYPES member
Aug 28 12:44:48 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:48.540+1000 7f5a5b1ee000 -1 mgr[py] Module balancer has missing NOTIFY_TYPES member
Aug 28 12:44:48 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:48.752+1000 7f5a5b1ee000 -1 mgr[py] Module crash has missing NOTIFY_TYPES member
Aug 28 12:44:49 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:49.676+1000 7f5a5b1ee000 -1 mgr[py] Module devicehealth has missing NOTIFY_TYPES member
Aug 28 12:44:49 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:49.784+1000 7f5a5b1ee000 -1 mgr[py] Module influx has missing NOTIFY_TYPES member
Aug 28 12:44:49 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:49.984+1000 7f5a5b1ee000 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
Aug 28 12:44:50 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:50.524+1000 7f5a5b1ee000 -1 mgr[py] Module nfs has missing NOTIFY_TYPES member
Aug 28 12:44:50 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:50.812+1000 7f5a5b1ee000 -1 mgr[py] Module orchestrator has missing NOTIFY_TYPES member
Aug 28 12:44:51 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:51.008+1000 7f5a5b1ee000 -1 mgr[py] Module osd_perf_query has missing NOTIFY_TYPES member
Aug 28 12:44:51 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:51.108+1000 7f5a5b1ee000 -1 mgr[py] Module osd_support has missing NOTIFY_TYPES member
Aug 28 12:44:51 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:51.312+1000 7f5a5b1ee000 -1 mgr[py] Module pg_autoscaler has missing NOTIFY_TYPES member
Aug 28 12:44:51 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:51.420+1000 7f5a5b1ee000 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
Aug 28 12:44:51 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:51.852+1000 7f5a5b1ee000 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
Aug 28 12:44:52 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:52.024+1000 7f5a5b1ee000 -1 mgr[py] Module rbd_support has missing NOTIFY_TYPES member
Aug 28 12:44:52 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:52.676+1000 7f5a5b1ee000 -1 mgr[py] Module selftest has missing NOTIFY_TYPES member
Aug 28 12:44:52 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:52.792+1000 7f5a5b1ee000 -1 mgr[py] Module snap_schedule has missing NOTIFY_TYPES member
Aug 28 12:44:53 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:53.096+1000 7f5a5b1ee000 -1 mgr[py] Module status has missing NOTIFY_TYPES member
Aug 28 12:44:53 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:53.196+1000 7f5a5b1ee000 -1 mgr[py] Module telegraf has missing NOTIFY_TYPES member
Aug 28 12:44:53 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:53.484+1000 7f5a5b1ee000 -1 mgr[py] Module telemetry has missing NOTIFY_TYPES member
Aug 28 12:44:53 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:53.776+1000 7f5a5b1ee000 -1 mgr[py] Module test_orchestrator has missing NOTIFY_TYPES member
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.124+1000 7f5a5b1ee000 -1 mgr[py] Module volumes has missing NOTIFY_TYPES member
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.224+1000 7f5a5b1ee000 -1 mgr[py] Module zabbix has missing NOTIFY_TYPES member
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a25aea640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a25aea640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a25aea640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a25aea640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a25aea640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a27aee640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a27aee640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a27aee640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a27aee640 -1 client.0 error registering admin socket command: (17) File exists
Aug 28 12:44:54 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-28T12:44:54.384+1000 7f5a27aee640 -1 client.0 error registering admin socket command: (17) File exists
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: 2024-08-29T10:43:56.541+1000 7f5a13ec7640 -1 Remote method threw exception: Traceback (most recent call last):
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: File "/usr/share/ceph/mgr/nfs/module.py", line 189, in cluster_ls
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: return available_clusters(self)
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: File "/usr/share/ceph/mgr/nfs/utils.py", line 70, in available_clusters
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: completion = mgr.describe_service(service_type='nfs')
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1542, in inner
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: completion = self._oremote(method_name, args, kwargs)
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1609, in _oremote
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: raise NoOrchestrator()
Aug 29 10:43:56 node2.prompcorp.com.au microceph.mgr[384859]: orchestrator._interface.NoOrchestrator: No orchestrator configured (try `ceph orch set backend`)
I managed to remove the two nodes from the cluster, and tried to replicate the process:
1. made sure nodes no longer exists in cluster
2. snap remove microceph
3. snap refresh microceph --channel reef/stable
4. snap refresh --hold microceph
5. microceph cluster bootstrap (this step was missing from the documentation)
6. microceph clusrter add node and microceph cluster join
7. microceph disk add /dev/sdb --wipe
Failed with the following error
+----------+---------+
| PATH | STATUS |
+----------+---------+
| /dev/sdb | Failure |
+----------+---------+
Error: failed to generate OSD keyring: Failed to run: ceph auth get-or-create osd.35 mgr allow profile osd mon allow profile osd osd allow * -o /var/snap/microceph/common/data/osd/ceph-35/keyring: exit status 13 (2024-08-29T11:48:29.612+1000 7f4bfe368640 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
[errno 13] RADOS permission denied (error connecting to the cluster))
After further investigation, I found from microceph cluster status
the mon services and osd services are missing in node 1:
MicroCeph deployment summary:
- node1
Services:
Disks: 0
- node3
Services: mds, mgr, mon, osd
Disks: 1
- node4
Services: mds, mgr, mon, osd
Disks: 2
- node5
Services: mds, mgr, mon, osd
Disks: 2
but when I tried to run microceph enable mon
on node 1
Error: failed placing service mon: host failed hospitality check for mon enablement: mon service already active on host
Turns out the cluster is still out of sync:
Sep 03 15:32:31 node3 kernel: libceph: mon1 (1)0.0.0.3:6789 socket error on write
Sep 03 15:32:32 node3 kernel: libceph: mon3 (1)0.0.0.5:6789 socket closed (con state OPEN)
Sep 03 15:32:32 node3 kernel: libceph: mon0 (1)0.0.0.2:6789 socket closed (con state V1_BANNER) //should not even be in the cluster, microceph has been removed from this node, clearly not updated in libceph
Sep 03 15:32:32 node3 kernel: libceph: mon0 (1)0.0.0.2:6789 socket closed (con state V1_BANNER)
Sep 03 15:32:32 node3 kernel: libceph: mon0 (1)0.0.0.2:6789 socket closed (con state V1_BANNER)
Sep 03 15:32:32 node3 kernel: libceph: mon0 (1)0.0.0.2:6789 socket closed (con state V1_BANNER)
Sep 03 15:32:32 node3 kernel: libceph: mon1 (1)0.0.0.3:6789 socket error on write
Sep 03 15:32:32 node3 kernel: libceph: mon1 (1)0.0.0.3:6789 socket error on write
Sep 03 15:32:32 node3 kernel: libceph: mon2 (1)0.0.0.4:6789 socket closed (con state OPEN)
Issue report
What version of MicroCeph are you using ?
ceph-version: 18.2.0-0ubuntu3~cloud0; microceph-git: 69853e8621 installed: 18.2.0+snap69853e8621 (1100) 87MB held
What are the steps to reproduce this issue ?
What happens (observed behaviour) ?
ceph osd tree reports:
microceph status reports:
microceph disk list reports: Disks configured in MicroCeph: +-----+------------------------+----------------------------------------+ | OSD | LOCATION | PATH | +-----+------------------------+----------------------------------------+ | 1 | node5 | /dev/disk/by-id/wwn-0x5000c500e03686b1 | +-----+------------------------+----------------------------------------+ | 7 | node4| /dev/sdi | +-----+------------------------+----------------------------------------+ | 8 | node5 | /dev/sdi | +-----+------------------------+----------------------------------------+ | 16 | node4 | /dev/disk/by-id/wwn-0x5000c500d6298276 | +-----+------------------------+----------------------------------------+ | 24 | node1 | /dev/sdi | +-----+------------------------+----------------------------------------+ | 27 | node2 | /dev/sdi | +-----+------------------------+----------------------------------------+ | 28 | node2 | /dev/disk/by-id/wwn-0x50014ee26b84c31b | +-----+------------------------+----------------------------------------+ | 29 | node1 | /dev/disk/by-id/wwn-0x5000c500d5b910c3 | +-----+------------------------+----------------------------------------+
What were you expecting to happen ?
both should have the same amount of disks