When restarting OSDs, previously we were checking that the OSD processes were running (using psutil), but not that the OSDs had completely started up. This potentially means that if an OSD took rather longer than usual to start, the restart sequence could move along to the next node, and we could end up with multiple OSDs down across multiple nodes while doing a rolling restart.
This commit adds a call to ceph daemon osd.$OSD_ID status for each OSD, to ensure that the OSD has finished starting up properly. In the normal case this will add a couple of seconds per OSD to the restart process. Worst case, it'll hit a timeout at 64 seconds and bail out, then you need to investigate why the OSD didn't come up.
When restarting OSDs, previously we were checking that the OSD processes were running (using psutil), but not that the OSDs had completely started up. This potentially means that if an OSD took rather longer than usual to start, the restart sequence could move along to the next node, and we could end up with multiple OSDs down across multiple nodes while doing a rolling restart.
This commit adds a call to
ceph daemon osd.$OSD_ID status
for each OSD, to ensure that the OSD has finished starting up properly. In the normal case this will add a couple of seconds per OSD to the restart process. Worst case, it'll hit a timeout at 64 seconds and bail out, then you need to investigate why the OSD didn't come up.Fixes: https://bugzilla.suse.com/show_bug.cgi?id=1185422 Signed-off-by: Tim Serong tserong@suse.com
Checklist: