Designing a monitoring and reporting system for ESI nodes

hakasapl commented 3 months ago

In parallel with NERC, we want to monitor and alert for metrics for all ESI/OCT nodes from PDUs as well as IPMI data.

The purpose of this issue is to outline requirements and design a plan.

Requirements:

Per-node level power reporting (we can only get this through IPMI, since PDUs will give per-chassis)
Per-chassis level power reporting (PDU SNMP + IPMI)
Per-switch power reporting (PDU SNMP)
Per-rack power reporting (MGHPCC, PDU SNMP aggregation)
Nothing running in-band for performance reasons

Unknown Requirements:

Sampling rate requirements?
Accuracy/Precision requirements?

Proposed Solution:

Make use of the NERC prometheus for logging all metrics from IPMI on each node and chassis and PDUs in each rack.
Make use of the NERC front-end grafana/dashboard for ESI nodes
Replace any non-POPS PDUs with POPS PDUs (approved by @waygil already)
- MOC racks currently don't all have POPS pdus, BU research computing already does PDU monitoring but we don't yet, even though we are in BU pods.
Create a different OCT front-end that uses the same back-end DB since OCT doesn't use the same auth and project scheme that NERC does

Initial slack discussion:

HeidiPD
  1 hour ago
Wait!  John Goodhue said that he had this data for MGHPCC racks already.   Are you telling us we don't have the right kind of PDU in any of our MOC racks!??

HeidiPD
  [1 hour ago](https://massopencloud.slack.com/archives/CG1C4SZCM/p1718381751671999?thread_ts=1717774989.472689&cid=CG1C4SZCM)
there is already lots of evidence that ipmi tool affects the servers too much

HeidiPD
  1 hour ago
We can add any ESI node (or PDU) we want into the prometheus we use for NERC.   Just need to label it appropriately and pull in the data.

HeidiPD
  [1 hour ago](https://massopencloud.slack.com/archives/CG1C4SZCM/p1718381782026589?thread_ts=1717774989.472689&cid=CG1C4SZCM)
jonathan's experiments don't have to run on ESI nodes, if we have nodes in the clusters that have the right PDUs.   Also, how soon are we going to fix our PDUS??!!

HeidiPD
  [1 hour ago](https://massopencloud.slack.com/archives/CG1C4SZCM/p1718381855301589?thread_ts=1717774989.472689&cid=CG1C4SZCM)
(John told me he had per-server, not per-rack data available.)

HeidiPD
  1 hour ago
I don't think you can get per-node though

HeidiPD
  1 hour ago
(without running an extra tool)

Hakan Saplakoglu
:speech_balloon:  35 minutes ago
@HeidiPD
 MGHPCC has per-rack data for every MOC rack from the power consumption data off the bus bars. the PDUs themselves are managed by the tenants so I’m not sure what John meant but I don’t see any way he could have per-server data (unless he was talking about a specific MGHPCC test rack or something). Per-node data is possible through PDUs, but the facility doesn’t control the PDUs or connect anything other than the power bus bars to it. That’s left to the tenants.
ipmitool is a problem because you can do basically anything unauthenticated. But I don’t think doing purely get commands to read ipmi sensor data will cause any issues - is there something I’m missing about that?
We currently have 1 ESI rack full of the usual config of servers with the correct PDUs for power reporting. I wanted to get this rack running in prometheus/grafana before I started work on replacing the PDUs for the remaining racks. Does that match with your timeline?

HeidiPD
  8 minutes ago
So let's not design this in slack.  Also, I don't see why you are adding ESI racks to a separate promtheus/grafana.  In fact, once and ESI rack is leased to someone else, we should not be running any monitoring on the rack without the owners knowledge and agreement.   Lots of people who get ESI servers will be doing performance measurement, and this is a factor if they are doing detailed measurements.

HeidiPD
  [8 minutes ago](https://massopencloud.slack.com/archives/CG1C4SZCM/p1718385784205689?thread_ts=1717774989.472689&cid=CG1C4SZCM)
I think this came up already in an MOC meeting, so let's write down a plan this time.

HeidiPD
  7 minutes ago
I reviewed what John Goodhue told me, which was that for rack-level they have a redis cache with this info:  "We have for every rack:
  average power at 8-hour intervals since the beginning of time.
  peak power at 8 hour intervals from ~2015 forward
  A redis cache with measurements at 15 second intervals for the past 24 hours
    (I may have the retention time wrong, but that’s the approximate size)"

HeidiPD
  6 minutes ago
They say we can mirror this cache if we want to do that.

HeidiPD
  4 minutes ago
For server-level measurements, he suggested  collecting the data via the ethernet port on the PDU, or the IDRAC port on the server.   He said that  BU research computing group already monitors all BU PDUs.   I didn't get a chance to track down who at BU was doing that monitoring already, but we should find out if any of it is happening on our racks.
New

HeidiPD
  3 minutes ago
@Hakan Saplakoglu
 please start an issue for this, if we don't have one already (which I thought we did).

HeidiPD
  2 minutes ago
Some of the sustainability experiments montor both energy use and performance simultaneously, so that is the concern about monitoring having and undesired impact.

Hakan Saplakoglu
:speech_balloon:  2 minutes ago
I am adding the PDUs and idrac data to prometheus, nothing is running in-band. Let me figure out the issue
:thumbsup_all:
1

Hakan Saplakoglu
:speech_balloon:  [2 minutes ago](https://massopencloud.slack.com/archives/CG1C4SZCM/p1718386132470209?thread_ts=1717774989.472689&cid=CG1C4SZCM)
I think there is a lot of miscommunication going on here, hopefully we can iron this out in the issue

hpdempsey commented 3 months ago

Sampling rate and accuracy requirements are going to vary depending on the use case. I think Jonathan Appavoo and Han Dong will have the most strict requirements, so let's collect those. We may not want to collect at that level for all machines. Ideally, it would be good to be able to have sampling interval be configurable.

naved001 commented 3 months ago

As a ESI hardware administrator it would be great to get alerts about:

Drives that have failed or are about to fail
Memory failures
PSU failures
Fan failures
High Temperatures (system, CPU)

We can get all of this and more from iDrac, all out-of-band.

hakasapl commented 2 months ago

Closing this for now, I think we have a good plan

CCI-MOC / ops-issues

Designing a monitoring and reporting system for ESI nodes #1326