SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
161 stars 75 forks source link

Bump OSD available timeout to 120, make it configurable (bsc#1192520) #1882

Closed tserong closed 2 years ago

tserong commented 2 years ago

When restarting OSDs, we wait for them to become available before proceeding to the next OSD. Previosly, this wait was in a loop with a timeout of 60 seconds. This commit increases the default timeout to 120 seconds, and also makes it configurable in case that's still too short in some circumstances. If you need a different timeout, specify it in /srv/pillar/ceph/stack/global.yml, e.g.:

osd_available_timeout: 240

Note that the way the loop is implemented, the sleep in the loop doubles on each iteration. The runtime of the loop will thus be pretty close to a power of 2, e.g.: 1, 2, 4, 8, 16, 32, 64, ... seconds. It won't stop when the exact timeout is reached, rather it will stop at approximately the next power of 2 seconds greater than the specified timeout.

Fixes: https://bugzilla.suse.com/show_bug.cgi?id=1192520 Signed-off-by: Tim Serong tserong@suse.com

tserong commented 2 years ago

Please ignore the lint error E: 1, 0: Module file has the wrong file permissions(expected 0o644): 0o664 (file-perms) -- it's been like that forever, along with all the other modules files, and I'm not changing it now.

tserong commented 2 years ago

Thanks, I'll try to rework that slightly so it's less confusing. With the current implementation, if you set 240, it will actually run for slightly over 255 seconds, not 127.5, but I had to instrument it with a couple of calls to time.monotonic() at the start and end, then run it myself to verify that, which definitely means it's too confusingly implemented ;-)

tserong commented 2 years ago

I've reworked it somewhat, but TBH I'm not sure if it's less confusing or not :-/

The diff is a bit ugly - it's easier to view side-by-side/split than unified.

Note that I've changed the behaviour so the sleep interval only doubles until you get to a minute, then it just sleeps for a minute on each iteration, rather than doubling indefinitely.

If the rework looks sane, I'll squash the commits and update the commit messages properly.

tserong commented 2 years ago

(lint failure is file perms again and should be ignored for this PR)

tserong commented 2 years ago

Thanks @trociny! I've used delay = min(delay * 2, 60) because I think that's neater, and I've also squashed the commits and fixed up the commit message.

tserong commented 2 years ago

(That last force push was just a rebase)