Closed 0xavi0 closed 2 years ago
I really like how this is all split up to make the logic easy to understand, but I'd much prefer if we stick with using the files in /etc/ceph/osd/ rather than changing to using ceph-volume raw list
. In general, the approach I've been taking with fixes for DeepSea, is to make the minimum possible change to fix the problem, because the current behaviour is "battle tested" in the wild ;-) If we switch from using the files in /etc/ceph/osd/ to ceph-volume raw list
, the latter would need additional testing with various different combinations of OSDs, with and without shared devices etc.
It also might happen to SES6 clusters in which for whatever reason it was decided to create non-lvm disks, but in that case it also failed as Deepsea was trying to read raw devices assignments from /etc/ceph/osd/* and that folder doesn't exist for a SES6 cluster non upgraded from SES5.
We don't need to worry about that, because SES6 clusters can't create non-lvm OSDs (the ceph-disk tool isn't included in SES6).
For testing purposes, I'm working through my SES5->SES6 upgrade again, with OSDs with DB/WAL on shared devices. Check this out:
node1:~ # cat /proc/partitions
major minor #blocks name
254 0 20971520 vda
254 1 2050 vda1
254 2 204802 vda2
254 3 20761536 vda3
254 16 8388608 vdb
254 17 512000 vdb1
254 18 512000 vdb2
254 19 512000 vdb3
254 20 512000 vdb4
254 32 8388608 vdc
254 33 102400 vdc1
254 34 8285167 vdc2
254 48 8388608 vdd
254 49 102400 vdd1
254 50 8285167 vdd2
/dev/vda is my root disk. /dev/vdc and /dev/vdd are OSDs 1 and 7 respectively, and /dev/vdb contains the DB and WAL for OSDs 1 and 7.
Let's look at OSD 1 in more detail:
node1:~ # mount|grep ceph-1
/dev/vdc1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,relatime,attr2,inode64,noquota)
node1:~ # ls -l /var/lib/ceph/osd/ceph-1/
total 64
-rw-r--r-- 1 root root 393 Feb 2 03:55 activate.monmap
-rw-r--r-- 1 ceph ceph 3 Feb 2 03:56 active
lrwxrwxrwx 1 root root 9 Feb 2 04:27 block -> /dev/vdc2
lrwxrwxrwx 1 root root 9 Feb 2 04:27 block.db -> /dev/vdb2
-rw-r--r-- 1 ceph ceph 37 Feb 2 03:55 block.db_uuid
-rw-r--r-- 1 ceph ceph 37 Feb 2 03:55 block_uuid
lrwxrwxrwx 1 root root 9 Feb 2 04:27 block.wal -> /dev/vdb1
-rw-r--r-- 1 ceph ceph 37 Feb 2 03:55 block.wal_uuid
-rw-r--r-- 1 ceph ceph 2 Feb 2 03:55 bluefs
-rw-r--r-- 1 ceph ceph 37 Feb 2 03:55 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Feb 2 03:55 fsid
-rw------- 1 ceph ceph 56 Feb 2 03:55 keyring
-rw-r--r-- 1 ceph ceph 8 Feb 2 03:55 kv_backend
-rw-r--r-- 1 ceph ceph 21 Feb 2 03:55 magic
-rw-r--r-- 1 ceph ceph 4 Feb 2 03:55 mkfs_done
-rw-r--r-- 1 ceph ceph 6 Feb 2 03:55 ready
-rw------- 1 ceph ceph 3 Feb 2 04:27 require_osd_release
-rw-r--r-- 1 ceph ceph 0 Feb 2 03:56 systemd
-rw-r--r-- 1 ceph ceph 10 Feb 2 03:55 type
-rw-r--r-- 1 ceph ceph 2 Feb 2 03:55 whoami
So we can see that OSD 1 is using four partitions across two devices: /dev/vdc1 for its little XFS metadata partition, /dev/vdc2 for its block partition (i.e. where data is stored), /dev/vdb1 for its WAL and /dev/vbd2 for its DB.
The JSON generated during upgrade by ceph-volume simple scan --force
is as follows (which also includes those four devices, albeit with some in in "/dev/disk/..." form):
node1:~ # cat /etc/ceph/osd/1-729ca663-f36e-4434-ab7d-48534edb30c3.json
{
"active": "ok",
"block": {
"path": "/dev/disk/by-partuuid/4fcd6be5-1534-4500-a1e4-65b098bf5c5a",
"uuid": "4fcd6be5-1534-4500-a1e4-65b098bf5c5a"
},
"block.db": {
"path": "/dev/disk/by-partuuid/982da263-6948-4789-a5e0-d5ffb0815588",
"uuid": "982da263-6948-4789-a5e0-d5ffb0815588"
},
"block.db_uuid": "982da263-6948-4789-a5e0-d5ffb0815588",
"block.wal": {
"path": "/dev/disk/by-partuuid/1b337abb-48d8-4941-80ae-3c029c046b1e",
"uuid": "1b337abb-48d8-4941-80ae-3c029c046b1e"
},
"block.wal_uuid": "1b337abb-48d8-4941-80ae-3c029c046b1e",
"block_uuid": "4fcd6be5-1534-4500-a1e4-65b098bf5c5a",
"bluefs": 1,
"ceph_fsid": "42c7b82b-a713-3adf-9cce-792af117c9c9",
"cluster_name": "ceph",
"data": {
"path": "/dev/vdc1",
"uuid": "729ca663-f36e-4434-ab7d-48534edb30c3"
},
"fsid": "729ca663-f36e-4434-ab7d-48534edb30c3",
"keyring": "AQDMAPphkZGwCRAAMxrIcALlcLkAzXJ95woTSA==",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"ready": "ready",
"require_osd_release": "",
"systemd": "",
"type": "bluestore",
"whoami": 1
}
Then I tried ceph-volume raw list
, and discovered that it only shows the block partition for each OSD (/dev/vdc2 for OSD 1). It doesn't list the other three partitions that OSDs uses, so unfortunately there's not enough information in there anyway to be able to zap everything:
node1:~ # ceph-volume raw list
{
"1": {
"ceph_fsid": "42c7b82b-a713-3adf-9cce-792af117c9c9",
"device": "/dev/vdc2",
"osd_id": 1,
"osd_uuid": "729ca663-f36e-4434-ab7d-48534edb30c3",
"type": "bluestore"
},
"7": {
"ceph_fsid": "42c7b82b-a713-3adf-9cce-792af117c9c9",
"device": "/dev/vdd2",
"osd_id": 7,
"osd_uuid": "1de6a4ff-0722-4561-84e9-6b7ae7228ab8",
"type": "bluestore"
}
}
Fixes https://bugzilla.suse.com/show_bug.cgi?id=1194807
There is a problem when zapping partitions that are part of a disk with GPT partition table. Deepsea is zapping the partition but not the whole disk, so the partition table remains and doesn't allow the later deployment of the OSD in SES6.
This is mainly happening for clusters upgraded from SES5, with non-lvm disks.
It also might happen to SES6 clusters in which for whatever reason it was decided to create non-lvm disks, but in that case it also failed as Deepsea was trying to read raw devices assignments from /etc/ceph/osd/* and that folder doesn't exist for a SES6 cluster non upgraded from SES5.
This branch uses ceph-volume raw list to obtain the device to zap, instead to /etc/ceph/osd/*