Open gebn opened 5 years ago
No need to do fancy stuff with checking if values have changed (that's unreliable anyway).
One or more commands failing quickly (with an error that we don't currently log) is the reliable indicator of issues.
On receipt of an error that is not a timeout, establish a new session and re-try the collection (up to the ctx expiry, with back-off).
Need to create an internal collector that we can back-off. We cannot send the same metric twice to Collect()
's channel, so we need to build up a struct of the scrape result, then call .Send(ch)
on it when we're happy with it.
A badly behaved BMC will look like:
We never treat a BMC "differently", e.g. by putting it into a special mode - we just handle what we see, which is a better approach.
Disable command retries, increase timeout.
Increasing the per-command timeout to 1m fixed this, so it's caused by sending retries. We could extend the timeout, however the packet will likely be sent along the same path. Probably better to let it fail and re-establish the session (which may also fail, but that's better than leaving it in a bad state).
Having the option to disable retries in a session (setting the per-command timeout to 0) could be a solution here. Or just disable them full-stop - it shouldn't be possible to hold the library wrong (much). Or is something like a 5s timeout enough to mitigate the behaviour? Shame to remove this feature when most BMCs handle it correctly. The 60s timeout hasn't led to a reduction in scrape success rate.
Retry if: no response, malformed response, or completion code is CompletionCodeNodeBusy
.
Close (and maybe re-establish session) if: no response, malformed/truncated response.
Given the suspected buffer implementation, it would likely affect session-less commands as well as those sent within a session, so the timeout would have to be applied to both. Need to try spamming an affected BMC with session-less commands to verify behaviour.
Sometimes its the whole field, sometimes it's just a few bytes. Definitely a bug, as if it were corruption, the checksum validation would fail (is the checksum definitely there for these packets?).
If a BMC shows strangeness, it's fine to treat it with care (e.g. new connection each scrape) until the process is killed, even between
Close()
s on the collector.Could be an old socket full stop - not just the connection on it.
bmc_up
and/orchassis_cooling_fault
and/orchassis_powered_on
flaps. You only need the first one to identify this. Ifbmc_up == 0
, close the session before finishing collecting, so it is re-established next scrape. The debug mode in #23 would help here. Particularly the error returned by the command exec attempt.