SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
161 stars 75 forks source link

Wait for OSDs to be active after restarting (bsc#1185422) #1875

Closed tserong closed 3 years ago

tserong commented 3 years ago

When restarting OSDs, previously we were checking that the OSD processes were running (using psutil), but not that the OSDs had completely started up. This potentially means that if an OSD took rather longer than usual to start, the restart sequence could move along to the next node, and we could end up with multiple OSDs down across multiple nodes while doing a rolling restart.

This commit adds a call to ceph daemon osd.$OSD_ID status for each OSD, to ensure that the OSD has finished starting up properly. In the normal case this will add a couple of seconds per OSD to the restart process. Worst case, it'll hit a timeout at 64 seconds and bail out, then you need to investigate why the OSD didn't come up.

Fixes: https://bugzilla.suse.com/show_bug.cgi?id=1185422 Signed-off-by: Tim Serong tserong@suse.com


Checklist:

tserong commented 3 years ago

@susebot run teuthology

susebot commented 3 years ago

Commit 21514d60d444e0b99a77bfb1bc0e99a982147f6f is OK for suite deepsea:tier2. Check tests results in the Jenkins job: https://storage-ci.suse.de/job/pr-deepsea/494/