rebuild.node to zap raw partitions and disks

0xavi0 commented 2 years ago

Fixes https://bugzilla.suse.com/show_bug.cgi?id=1194807

There is a problem when zapping partitions that are part of a disk with GPT partition table. Deepsea is zapping the partition but not the whole disk, so the partition table remains and doesn't allow the later deployment of the OSD in SES6.

This is mainly happening for clusters upgraded from SES5, with non-lvm disks.

It also might happen to SES6 clusters in which for whatever reason it was decided to create non-lvm disks, but in that case it also failed as Deepsea was trying to read raw devices assignments from /etc/ceph/osd/* and that folder doesn't exist for a SES6 cluster non upgraded from SES5.

This branch uses ceph-volume raw list to obtain the device to zap, instead to /etc/ceph/osd/*

tserong commented 2 years ago

I really like how this is all split up to make the logic easy to understand, but I'd much prefer if we stick with using the files in /etc/ceph/osd/ rather than changing to using ceph-volume raw list. In general, the approach I've been taking with fixes for DeepSea, is to make the minimum possible change to fix the problem, because the current behaviour is "battle tested" in the wild ;-) If we switch from using the files in /etc/ceph/osd/ to ceph-volume raw list, the latter would need additional testing with various different combinations of OSDs, with and without shared devices etc.

It also might happen to SES6 clusters in which for whatever reason it was decided to create non-lvm disks, but in that case it also failed as Deepsea was trying to read raw devices assignments from /etc/ceph/osd/* and that folder doesn't exist for a SES6 cluster non upgraded from SES5.

We don't need to worry about that, because SES6 clusters can't create non-lvm OSDs (the ceph-disk tool isn't included in SES6).

tserong commented 2 years ago

For testing purposes, I'm working through my SES5->SES6 upgrade again, with OSDs with DB/WAL on shared devices. Check this out:

node1:~ # cat /proc/partitions
major minor  #blocks  name

 254        0   20971520 vda
 254        1       2050 vda1
 254        2     204802 vda2
 254        3   20761536 vda3
 254       16    8388608 vdb
 254       17     512000 vdb1
 254       18     512000 vdb2
 254       19     512000 vdb3
 254       20     512000 vdb4
 254       32    8388608 vdc
 254       33     102400 vdc1
 254       34    8285167 vdc2
 254       48    8388608 vdd
 254       49     102400 vdd1
 254       50    8285167 vdd2

/dev/vda is my root disk. /dev/vdc and /dev/vdd are OSDs 1 and 7 respectively, and /dev/vdb contains the DB and WAL for OSDs 1 and 7.

Let's look at OSD 1 in more detail:

node1:~ # mount|grep ceph-1
/dev/vdc1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,relatime,attr2,inode64,noquota)

node1:~ # ls -l /var/lib/ceph/osd/ceph-1/
total 64
-rw-r--r-- 1 root root 393 Feb  2 03:55 activate.monmap
-rw-r--r-- 1 ceph ceph   3 Feb  2 03:56 active
lrwxrwxrwx 1 root root   9 Feb  2 04:27 block -> /dev/vdc2
lrwxrwxrwx 1 root root   9 Feb  2 04:27 block.db -> /dev/vdb2
-rw-r--r-- 1 ceph ceph  37 Feb  2 03:55 block.db_uuid
-rw-r--r-- 1 ceph ceph  37 Feb  2 03:55 block_uuid
lrwxrwxrwx 1 root root   9 Feb  2 04:27 block.wal -> /dev/vdb1
-rw-r--r-- 1 ceph ceph  37 Feb  2 03:55 block.wal_uuid
-rw-r--r-- 1 ceph ceph   2 Feb  2 03:55 bluefs
-rw-r--r-- 1 ceph ceph  37 Feb  2 03:55 ceph_fsid
-rw-r--r-- 1 ceph ceph  37 Feb  2 03:55 fsid
-rw------- 1 ceph ceph  56 Feb  2 03:55 keyring
-rw-r--r-- 1 ceph ceph   8 Feb  2 03:55 kv_backend
-rw-r--r-- 1 ceph ceph  21 Feb  2 03:55 magic
-rw-r--r-- 1 ceph ceph   4 Feb  2 03:55 mkfs_done
-rw-r--r-- 1 ceph ceph   6 Feb  2 03:55 ready
-rw------- 1 ceph ceph   3 Feb  2 04:27 require_osd_release
-rw-r--r-- 1 ceph ceph   0 Feb  2 03:56 systemd
-rw-r--r-- 1 ceph ceph  10 Feb  2 03:55 type
-rw-r--r-- 1 ceph ceph   2 Feb  2 03:55 whoami

So we can see that OSD 1 is using four partitions across two devices: /dev/vdc1 for its little XFS metadata partition, /dev/vdc2 for its block partition (i.e. where data is stored), /dev/vdb1 for its WAL and /dev/vbd2 for its DB.

The JSON generated during upgrade by ceph-volume simple scan --force is as follows (which also includes those four devices, albeit with some in in "/dev/disk/..." form):

node1:~ # cat /etc/ceph/osd/1-729ca663-f36e-4434-ab7d-48534edb30c3.json 
{
    "active": "ok", 
    "block": {
        "path": "/dev/disk/by-partuuid/4fcd6be5-1534-4500-a1e4-65b098bf5c5a", 
        "uuid": "4fcd6be5-1534-4500-a1e4-65b098bf5c5a"
    }, 
    "block.db": {
        "path": "/dev/disk/by-partuuid/982da263-6948-4789-a5e0-d5ffb0815588", 
        "uuid": "982da263-6948-4789-a5e0-d5ffb0815588"
    }, 
    "block.db_uuid": "982da263-6948-4789-a5e0-d5ffb0815588", 
    "block.wal": {
        "path": "/dev/disk/by-partuuid/1b337abb-48d8-4941-80ae-3c029c046b1e", 
        "uuid": "1b337abb-48d8-4941-80ae-3c029c046b1e"
    }, 
    "block.wal_uuid": "1b337abb-48d8-4941-80ae-3c029c046b1e", 
    "block_uuid": "4fcd6be5-1534-4500-a1e4-65b098bf5c5a", 
    "bluefs": 1, 
    "ceph_fsid": "42c7b82b-a713-3adf-9cce-792af117c9c9", 
    "cluster_name": "ceph", 
    "data": {
        "path": "/dev/vdc1", 
        "uuid": "729ca663-f36e-4434-ab7d-48534edb30c3"
    }, 
    "fsid": "729ca663-f36e-4434-ab7d-48534edb30c3", 
    "keyring": "AQDMAPphkZGwCRAAMxrIcALlcLkAzXJ95woTSA==", 
    "kv_backend": "rocksdb", 
    "magic": "ceph osd volume v026", 
    "mkfs_done": "yes", 
    "ready": "ready", 
    "require_osd_release": "", 
    "systemd": "", 
    "type": "bluestore", 
    "whoami": 1
}

Then I tried ceph-volume raw list, and discovered that it only shows the block partition for each OSD (/dev/vdc2 for OSD 1). It doesn't list the other three partitions that OSDs uses, so unfortunately there's not enough information in there anyway to be able to zap everything:

node1:~ # ceph-volume raw list
{
    "1": {
        "ceph_fsid": "42c7b82b-a713-3adf-9cce-792af117c9c9",
        "device": "/dev/vdc2",
        "osd_id": 1,
        "osd_uuid": "729ca663-f36e-4434-ab7d-48534edb30c3",
        "type": "bluestore"
    },
    "7": {
        "ceph_fsid": "42c7b82b-a713-3adf-9cce-792af117c9c9",
        "device": "/dev/vdd2",
        "osd_id": 7,
        "osd_uuid": "1de6a4ff-0722-4561-84e9-6b7ae7228ab8",
        "type": "bluestore"
    }
}

SUSE / DeepSea

rebuild.node to zap raw partitions and disks #1886