Some BMCs start sending garbage values after a while

gebn / bmc_exporter

Exposes Baseboard Management Controller data in Prometheus format.

GNU Lesser General Public License v3.0

47 stars 3 forks source link

Some BMCs start sending garbage values after a while #29

Open gebn opened 5 years ago

gebn commented 5 years ago

Firmware version changes (with a minor version >99, which is impossible for a 1 byte BCD value)
GUID changes

Sometimes its the whole field, sometimes it's just a few bytes. Definitely a bug, as if it were corruption, the checksum validation would fail (is the checksum definitely there for these packets?).

If a BMC shows strangeness, it's fine to treat it with care (e.g. new connection each scrape) until the process is killed, even between Close()s on the collector.

Could be an old socket full stop - not just the connection on it. bmc_up and/or chassis_cooling_fault and/or chassis_powered_on flaps. You only need the first one to identify this. If bmc_up == 0, close the session before finishing collecting, so it is re-established next scrape. The debug mode in #23 would help here. Particularly the error returned by the command exec attempt.

gebn commented 5 years ago

No need to do fancy stuff with checking if values have changed (that's unreliable anyway). One or more commands failing quickly (with an error that we don't currently log) is the reliable indicator of issues. On receipt of an error that is not a timeout, establish a new session and re-try the collection (up to the ctx expiry, with back-off). Need to create an internal collector that we can back-off. We cannot send the same metric twice to Collect()'s channel, so we need to build up a struct of the scrape result, then call .Send(ch) on it when we're happy with it.

A badly behaved BMC will look like:

Establish session, collect values.
Session exists, collect values, error, re-establish, collect values.
Session exists, collect values, error, re-establish, collect values. ...

We never treat a BMC "differently", e.g. by putting it into a special mode - we just handle what we see, which is a better approach.

gebn commented 5 years ago

Disable command retries, increase timeout.

gebn commented 5 years ago

Increasing the per-command timeout to 1m fixed this, so it's caused by sending retries. We could extend the timeout, however the packet will likely be sent along the same path. Probably better to let it fail and re-establish the session (which may also fail, but that's better than leaving it in a bad state).

gebn commented 5 years ago

Having the option to disable retries in a session (setting the per-command timeout to 0) could be a solution here. Or just disable them full-stop - it shouldn't be possible to hold the library wrong (much). Or is something like a 5s timeout enough to mitigate the behaviour? Shame to remove this feature when most BMCs handle it correctly. The 60s timeout hasn't led to a reduction in scrape success rate.

gebn commented 5 years ago

Scrape arrives
- Scrape timeout starts in /bmc handler
Eventually we get to the Collect() method
- Collect timeout starts at beginning of Collect()
Assuming first scrape, we establish a session; session-less commands are re-sent after the default 1s timeout
Session is established, session-based commands begin
Each command is sent with a 2s context (controls end-to-end, including retries which are each 1s; retries use an exponential back-off). If it fails, we retry once with a new session.

Retry if: no response, malformed response, or completion code is CompletionCodeNodeBusy. Close (and maybe re-establish session) if: no response, malformed/truncated response.

gebn commented 5 years ago

Given the suspected buffer implementation, it would likely affect session-less commands as well as those sent within a session, so the timeout would have to be applied to both. Need to try spamming an affected BMC with session-less commands to verify behaviour.