Closed tserong closed 2 years ago
Just to demonstrate, let's say we have a four node cluster, each node having 1 SSD and 2 spinners, with two drivegroups defined:
# cat /srv/salt/ceph/configuration/files/drive_groups.yml
all_devs:
target: 'node[12]*'
data_devices:
all: true
shared_db:
target: 'node[34]*'
data_devices:
rotational: 1
db_devices:
rotational: 0
When deploying initially (stage 3), we see:
# salt-run --log-level=warning state.orch ceph.stage.3
[...]
Found DriveGroup <all_devs>
Calling dg.deploy on compound target node[12]*
Found DriveGroup <shared_db>
Calling dg.deploy on compound target node[34]*
This correctly deploys the drive groups on the nodes you'd expect (as always happened). In this example, three disks used as OSDs on each of the first two nodes, and two OSDs on the other two nodes (the third disk is the shared db, so isn't listed in ceph osd tree
output):
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.09354 root default
-3 0.02339 host node1
2 hdd 0.00780 osd.2 up 1.00000 1.00000
5 hdd 0.00780 osd.5 up 1.00000 1.00000
0 ssd 0.00780 osd.0 up 1.00000 1.00000
-5 0.02339 host node2
3 hdd 0.00780 osd.3 up 1.00000 1.00000
4 hdd 0.00780 osd.4 up 1.00000 1.00000
1 ssd 0.00780 osd.1 up 1.00000 1.00000
-10 0.02338 host node3
7 hdd 0.01169 osd.7 up 1.00000 1.00000
9 hdd 0.01169 osd.9 up 1.00000 1.00000
-13 0.02338 host node4
6 hdd 0.01169 osd.6 up 1.00000 1.00000
8 hdd 0.01169 osd.8 up 1.00000 1.00000
Prior to this fix, if I were to rebuild node3 or node4, it would end up just deploying all drive groups in sequence on node3, which is wrong, and in this case means that node ends up with three standalone OSDs, not two with a shared db like we expected:
# salt-run --log-level=warning rebuild.node node3.ses6.test
[...]
Found DriveGroup <all_devs>
Calling dg.deploy on compound target node3.ses6.test
Found DriveGroup <shared_db>
Calling dg.deploy on compound target node3.ses6.test
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.09355 root default
-3 0.02339 host node1
2 hdd 0.00780 osd.2 up 1.00000 1.00000
5 hdd 0.00780 osd.5 up 1.00000 1.00000
0 ssd 0.00780 osd.0 up 1.00000 1.00000
-5 0.02339 host node2
3 hdd 0.00780 osd.3 up 1.00000 1.00000
4 hdd 0.00780 osd.4 up 1.00000 1.00000
1 ssd 0.00780 osd.1 up 1.00000 1.00000
-10 0.02339 host node3
9 hdd 0.00780 osd.9 up 1.00000 1.00000
10 hdd 0.00780 osd.10 up 1.00000 1.00000
7 ssd 0.00780 osd.7 up 1.00000 1.00000
-13 0.02338 host node4
6 hdd 0.01169 osd.6 up 1.00000 1.00000
8 hdd 0.01169 osd.8 up 1.00000 1.00000
Now, with this fix, when rebuilding node3 or node4 we see:
# salt-run --log-level=warning rebuild.node node3.ses6.test
[...]
Found DriveGroup <all_devs>
Calling dg.deploy on compound target ( node[12]* ) and ( node3.ses6.test )
No minions matched the target. No command was sent, no jid was assigned.
Found DriveGroup <shared_db>
Calling dg.deploy on compound target ( node[34]* ) and ( node3.ses6.test )
Note how the first drivegroup (compound target ( node[12]* ) and ( node3.ses6.test )
doesn't match any minions, so isn't applied, whereas the second drivegroup does match, and is applied to the specified node, and we're back to what we expected to see:
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.09354 root default
-3 0.02339 host node1
2 hdd 0.00780 osd.2 up 1.00000 1.00000
5 hdd 0.00780 osd.5 up 1.00000 1.00000
0 ssd 0.00780 osd.0 up 1.00000 1.00000
-5 0.02339 host node2
3 hdd 0.00780 osd.3 up 1.00000 1.00000
4 hdd 0.00780 osd.4 up 1.00000 1.00000
1 ssd 0.00780 osd.1 up 1.00000 1.00000
-10 0.02338 host node3
7 hdd 0.01169 osd.7 up 1.00000 1.00000
9 hdd 0.01169 osd.9 up 1.00000 1.00000
-13 0.02338 host node4
6 hdd 0.01169 osd.6 up 1.00000 1.00000
8 hdd 0.01169 osd.8 up 1.00000 1.00000
Previously,
salt-run rebuild.node $NODE
would override the targets specified in drive_groups.yml, and attempt to apply all drive groups to the specified node (of course only the first one would succeed and the rest would likely do nothing). This commit makes sure that only the drive groups whose configured targets actually match the specified node are applied (see https://bugzilla.suse.com/show_bug.cgi?id=1198929)There's also a second commit in here which makes rebuild.node work if there's no PGs at all, which I hit while testing the above. This is an edge case that I have difficulty imagining anyone hitting on a real cluster, because by the time you get to rebuilding a node you're probably running a cluster with actual data in it.