SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
161 stars 75 forks source link

Take drivegroup targets into account for rebuild.node (bsc#1198929) #1890

Closed tserong closed 2 years ago

tserong commented 2 years ago

Previously, salt-run rebuild.node $NODE would override the targets specified in drive_groups.yml, and attempt to apply all drive groups to the specified node (of course only the first one would succeed and the rest would likely do nothing). This commit makes sure that only the drive groups whose configured targets actually match the specified node are applied (see https://bugzilla.suse.com/show_bug.cgi?id=1198929)

There's also a second commit in here which makes rebuild.node work if there's no PGs at all, which I hit while testing the above. This is an edge case that I have difficulty imagining anyone hitting on a real cluster, because by the time you get to rebuilding a node you're probably running a cluster with actual data in it.

tserong commented 2 years ago

Just to demonstrate, let's say we have a four node cluster, each node having 1 SSD and 2 spinners, with two drivegroups defined:

# cat /srv/salt/ceph/configuration/files/drive_groups.yml
all_devs:
  target: 'node[12]*'
  data_devices:
    all: true

shared_db:
  target: 'node[34]*'
  data_devices:
    rotational: 1
  db_devices:
    rotational: 0

When deploying initially (stage 3), we see:

# salt-run --log-level=warning state.orch ceph.stage.3
[...]
Found DriveGroup <all_devs>
Calling dg.deploy on compound target node[12]*
Found DriveGroup <shared_db>
Calling dg.deploy on compound target node[34]*

This correctly deploys the drive groups on the nodes you'd expect (as always happened). In this example, three disks used as OSDs on each of the first two nodes, and two OSDs on the other two nodes (the third disk is the shared db, so isn't listed in ceph osd tree output):

# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
 -1       0.09354 root default                           
 -3       0.02339     host node1                         
  2   hdd 0.00780         osd.2      up  1.00000 1.00000 
  5   hdd 0.00780         osd.5      up  1.00000 1.00000 
  0   ssd 0.00780         osd.0      up  1.00000 1.00000 
 -5       0.02339     host node2                         
  3   hdd 0.00780         osd.3      up  1.00000 1.00000 
  4   hdd 0.00780         osd.4      up  1.00000 1.00000 
  1   ssd 0.00780         osd.1      up  1.00000 1.00000 
-10       0.02338     host node3                         
  7   hdd 0.01169         osd.7      up  1.00000 1.00000 
  9   hdd 0.01169         osd.9      up  1.00000 1.00000 
-13       0.02338     host node4                         
  6   hdd 0.01169         osd.6      up  1.00000 1.00000 
  8   hdd 0.01169         osd.8      up  1.00000 1.00000 

Prior to this fix, if I were to rebuild node3 or node4, it would end up just deploying all drive groups in sequence on node3, which is wrong, and in this case means that node ends up with three standalone OSDs, not two with a shared db like we expected:

# salt-run --log-level=warning rebuild.node node3.ses6.test
[...]
Found DriveGroup <all_devs>
Calling dg.deploy on compound target node3.ses6.test
Found DriveGroup <shared_db>
Calling dg.deploy on compound target node3.ses6.test

# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
 -1       0.09355 root default                           
 -3       0.02339     host node1                         
  2   hdd 0.00780         osd.2      up  1.00000 1.00000 
  5   hdd 0.00780         osd.5      up  1.00000 1.00000 
  0   ssd 0.00780         osd.0      up  1.00000 1.00000 
 -5       0.02339     host node2                         
  3   hdd 0.00780         osd.3      up  1.00000 1.00000 
  4   hdd 0.00780         osd.4      up  1.00000 1.00000 
  1   ssd 0.00780         osd.1      up  1.00000 1.00000 
-10       0.02339     host node3                         
  9   hdd 0.00780         osd.9      up  1.00000 1.00000 
 10   hdd 0.00780         osd.10     up  1.00000 1.00000 
  7   ssd 0.00780         osd.7      up  1.00000 1.00000 
-13       0.02338     host node4                         
  6   hdd 0.01169         osd.6      up  1.00000 1.00000 
  8   hdd 0.01169         osd.8      up  1.00000 1.00000 

Now, with this fix, when rebuilding node3 or node4 we see:

# salt-run --log-level=warning rebuild.node node3.ses6.test
[...]
Found DriveGroup <all_devs>
Calling dg.deploy on compound target ( node[12]* ) and ( node3.ses6.test )
No minions matched the target. No command was sent, no jid was assigned.
Found DriveGroup <shared_db>
Calling dg.deploy on compound target ( node[34]* ) and ( node3.ses6.test )

Note how the first drivegroup (compound target ( node[12]* ) and ( node3.ses6.test ) doesn't match any minions, so isn't applied, whereas the second drivegroup does match, and is applied to the specified node, and we're back to what we expected to see:

# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
 -1       0.09354 root default                           
 -3       0.02339     host node1                         
  2   hdd 0.00780         osd.2      up  1.00000 1.00000 
  5   hdd 0.00780         osd.5      up  1.00000 1.00000 
  0   ssd 0.00780         osd.0      up  1.00000 1.00000 
 -5       0.02339     host node2                         
  3   hdd 0.00780         osd.3      up  1.00000 1.00000 
  4   hdd 0.00780         osd.4      up  1.00000 1.00000 
  1   ssd 0.00780         osd.1      up  1.00000 1.00000 
-10       0.02338     host node3                         
  7   hdd 0.01169         osd.7      up  1.00000 1.00000 
  9   hdd 0.01169         osd.9      up  1.00000 1.00000 
-13       0.02338     host node4                         
  6   hdd 0.01169         osd.6      up  1.00000 1.00000 
  8   hdd 0.01169         osd.8      up  1.00000 1.00000