chaos / powerman

cluster power control
GNU General Public License v2.0
43 stars 19 forks source link

redfishpower: adapt status polling interval #167

Closed chu11 closed 7 months ago

chu11 commented 7 months ago

Problem: The status polling interval is hard coded to 1 second long. This can result in an excessive number of polling messages being sent when it is known that some hardware takes 20-50 seconds to complete a power operation.

Solution: Support a modified "exponential backoff" of the status polling interval. The modified algorithm is based on observations of how long it typically takes to complete power operations on hardware. The status polling interval begins at one second, but it gets capped at 4 seconds.

chu11 commented 7 months ago

It seems like a fixed polling interval is unlikely to be anybody's choice and we could skip that?

It was half added just as an "emergency" in case some random board we come upon doesn't quite meet the adapted algorithm I came up with. That commit could be removed, it isn't super necessary./

would an exponential backoff with a slower start allow you to avoid the need for the more logic based approach you've got ? E.g.

hmmmm, I guess that could work. I guess on the boards I've seen that are 47-50 seconds, I'm just afraid of the exponential backoff growing too large around that point in time, i.e. the exponential backoff has grown to 10+ seconds. In your example it would have grown to ~13 seconds and the overall wait would be 56 seconds. With my current adapted one we'd wait 48 or 52 seconds.

I think it's sort of a draw? The adapted one was calculated specifically for the cases I've seen.

garlick commented 7 months ago

Oh I was figuring you would cap the delay at 10s or so (or whatever makes sense). If we need to make it tunable, then I would add a command that lets you sets the start, multiplicative factor, and cap instead of only allowing only a non-adaptive delay to be configured. But I'm not sure its needed.

chu11 commented 7 months ago

Oh I was figuring you would cap the delay at 10s or so (or whatever makes sense)

ahhh ok. Then perhaps capping it at ... ehhh 5 seconds or so should be a good balance. maybe the multiplicative would be like 1.1 or 1.2 then. I'll play around with it.

chu11 commented 7 months ago

re-pushed, removing the setstatuspollinginterval configuration, which we deemed unnecessary.

I ended up just keeping the "logic based" exponential backoff, b/c doing the 1.3X (or similar) multiplier ended up not quite as logically simple as we would have thought (i.e. "if this is the first poll, it is 1 second", "if this poll is > X seconds, cap at X seconds", etc. logic added just as much logic)

chu11 commented 7 months ago

re-pushed

garlick commented 7 months ago

Perfect - thanks!