Wait for OSDs to be active after restarting (bsc#1185422)

tserong commented 3 years ago

When restarting OSDs, previously we were checking that the OSD processes were running (using psutil), but not that the OSDs had completely started up. This potentially means that if an OSD took rather longer than usual to start, the restart sequence could move along to the next node, and we could end up with multiple OSDs down across multiple nodes while doing a rolling restart.

This commit adds a call to ceph daemon osd.$OSD_ID status for each OSD, to ensure that the OSD has finished starting up properly. In the normal case this will add a couple of seconds per OSD to the restart process. Worst case, it'll hit a timeout at 64 seconds and bail out, then you need to investigate why the OSD didn't come up.

Fixes: https://bugzilla.suse.com/show_bug.cgi?id=1185422 Signed-off-by: Tim Serong tserong@suse.com

Checklist:

[ ] Added unittests and or functional tests
[ ] Adapted documentation
[ ] Referenced issues or internal bugtracker
[ ] Ran integration tests successfully (trigger with "@susebot run teuthology" in a GitHub comment; see the wiki for more information)

tserong commented 3 years ago

@susebot run teuthology

susebot commented 3 years ago

Commit 21514d60d444e0b99a77bfb1bc0e99a982147f6f is OK for suite deepsea:tier2. Check tests results in the Jenkins job: https://storage-ci.suse.de/job/pr-deepsea/494/

SUSE / DeepSea

Wait for OSDs to be active after restarting (bsc#1185422) #1875