Open jschmid1 opened 5 years ago
The error is raised due to an empty return from disks.deploy
which is evaluated (unconditionally) in rebuild.py.
The fix for this is two-fold:
1) Allow to have empty return (but raise an error) in _check_deploy
2) Make osd.remove $id
actually zap the disk (unmount/zap/clean)
Due to a yet unknown behavior in the osd.remove
runner, which no longer completely destroys the LVs on the OSDs it destroys:
after a zap:
data1:~ # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 254:0 0 20G 0 disk
└─vda1 254:1 0 20G 0 part /
vdb 254:16 0 20G 0 disk
└─ceph--051e6039--51aa--4216--92b1--15d97b25a1f0-osd--data--141e0a0c--86f8--4082--99eb--6e526010d7f7
253:0 0 19G 0 lvm
vdc 254:32 0 20G 0 disk
└─ceph--e6cd697b--029b--4b78--8838--e1af0aa9e1df-osd--data--c1cd2c10--ee8b--4f90--bd0a--f26b4bcfc078
253:1 0 19G 0 lvm
vdd 254:48 0 20G 0 disk
└─ceph--29305dd3--c305--4f87--8e6d--ab3507d1c70b-osd--data--a634f37e--4a1c--483c--bd8a--312361bd779c
253:2 0 19G 0 lvm
vde 254:64 0 20G 0 disk
└─ceph--8ddf59a4--e7bd--4d9c--b06f--11e24aec4e52-osd--data--0ec7d57e--96dc--4a6f--a9aa--d2acbb68293d
253:3 0 19G 0 lvm
vdf 254:80 0 20G 0 disk
└─ceph--bf65731b--15d4--4c62--b04c--46a796786b29-osd--data--5a37b17c--5f56--4cb3--b6cf--9d3153c7032b
253:4 0 19G 0 lvm
vdg 254:96 0 10G 0 disk
└─ceph--226cf2f2--5f77--41fd--917c--f84601ba9d4d-osd--data--494a7156--7c50--4c59--b750--876cd38438ce
253:5 0 9G 0 lvm
vdh 254:112 0 10G 0 disk
└─ceph--49cb4914--6852--4ada--af04--7f58ff9ac38d-osd--data--7227cdd6--87d9--4ef1--b8e0--787129bef068
253:6 0 9G 0 lvm
it's expected to have clean/unmounted disks after this.
The command that is used is ceph-volume lvm zap --osd-id $id --destroy
I suspect that ceph-volume handles drives differently when passed with --osd-id and when passed a raw device (/dev/sdx). The --destroy parameter shouldn't change it's behavior based on the input. This needs to be verified though.
Note that this only started happening when we switched the ceph package from 14.2.2 to 14.2.3.
Also, upstream has released (or, rather, is in the process of releasing) 14.2.4 to fix this. See https://github.com/ceph/ceph/pull/30429
Confirmed - the failure doesn't happen with 14.2.4, so this is another symptom of the ceph-volume regression that found its way into 14.2.3.
salt-run --no-color state.orch ceph.functests.3nodes
fails reproducably with: