elastic / ecs

Elastic Common Schema
https://www.elastic.co/what-is/ecs
Apache License 2.0
997 stars 413 forks source link

Bare Metal Chassis & Hardware Data #1029

Open pfk5000 opened 3 years ago

pfk5000 commented 3 years ago

Summary

Looking for a way to collect chassis and hardware metrics on Bare Metal servers. Ideally, this solution would integrate with the Elastic Stack by using an existing beat, so the data is as heterogeneous as possible.

Motivation:

My team is currently experimenting with Metricbeat to monitor bare metal servers in our private cloud. The "add_host_metadata" processor adds some useful information about the OS, kernel, ip address, and MAC address. What we're looking for next are some metrics about the underlying chassis and hardware. More specifically: manufacturer, model, HBA WWN, power supply health, fan speed, temperature, etc. We would really like a solution that uses beats, so the data is easily meshed with our other host metadata.

Detailed Design:

Our goal is to collect the following:

system.manufacturer : "Dell Inc." system.model: "PowerEdge R740" system.serialnumber: "AB12345" bios.version: "2.2.11"

The above metrics can be retrieved from within the OS by decoding the DMI table (SMBIOS) using dmidecode, or something similar. When a problem is detected, identifying the hardware type and serial number are critical to responding quickly.

sensor.ps1.status: "OK" sensor.ps1.inputpower: "116" sensor.ps1.temperature: "21" sensor.fan1.status: "OK" sensor.fan1.rpm: "4116" sensor.hdd1.status: "OK" sensor.chassis.event: "0x0" sensor.chassis.airflow: "40.0" sensor.BB.temp: "26"

The above metrics can also be retrieved from within the OS by calling the system management interface (iDRAC/IPMI/ILO). This data can proactively help our organization detect potential anomalies before they become a system failure.

Use Case

With sufficient data, we can leverage the Elastic Stack to detect when fan speed or power supply voltage is out of specification. Although the system controller may not flag the anomaly as yet "failed", Elastic's ML capabilities may predict an imminent failure.

webmat commented 3 years ago

Hello! There's a few aspects to your question.

ECS being the schema behind a lot of our solutions, my take on your request would be to make sure you capture this information in custom fields that are named in a way to not conflict with future versions of ECS. Here's our documentation about this: https://www.elastic.co/guide/en/ecs/current/ecs-custom-fields-in-ecs.html

I suspect what you're asking for however is a pre-built agent that could help you capture this, without doing all of the work yourself. For this you'd have to open an issue on the Beat repository, not the ECS repository.

However here's another thing you could experiment with, until you get an answer from the Metricbeat folks:

Filebeat has an OSQuery module, which simplifies collecting all of the information OSQuery can collect, from across your fleet and into the Elastic Stack. So you could look at the osquery module here to get started with this :-)

Getting back to the schema perspective, when you do have something going with either Filebeat/osquery or Metricbeat, I do recommend you keep using the add_host_metadata processor, as it automatically collects a baseline of information in the standard fields. This way you'll easily be able to pivot between your custom detailed host detail information, and other security or observability resources via the standard fields such as host.name, host.ip, filter by host.os.* and so on.