bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
115 stars 34 forks source link
centos health-status icinga2 monitoring monitoring-plugin nagios-plugin netbox redfish redfish-requests redhat support-redfish

check_redfish.py

This is a monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create an inventory of all components of a system.

NetBox import support

You are also able to import the inventory files into NetBox using netbox-snyc.

Requirements

Installation

RedHat based OS

Icinga2 and Grafana

Command definitions and a service config example for Icinga2 can be found in contrib. There is also an InfluxDB dashboard for some metrics included.

HELP

usage: check_redfish.py [-H HOST] [-u USERNAME] [-p PASSWORD] [-f AUTHFILE]
                        [--sessionfile SESSIONFILE]
                        [--sessionfiledir SESSIONFILEDIR] [--sessionlock]
                        [--nosession] [-h] [-w WARNING] [-c CRITICAL] [-v]
                        [-d] [-m MAX] [-r RETRIES] [-t TIMEOUT]
                        [--log_exclude LOG_EXCLUDE] [--ignore_missing_ps]
                        [--enable_bmc_security_warning] [--storage] [--proc]
                        [--memory] [--power] [--temp] [--fan] [--nic] [--bmc]
                        [--info] [--firmware] [--sel] [--mel] [--all] [-i]
                        [--inventory_id INVENTORY_ID]
                        [--inventory_name INVENTORY_NAME]
                        [--inventory_file INVENTORY_FILE]

This is a monitoring/inventory plugin to check components and
health status of systems which support Redfish.
It will also create a inventory of all components of a system.

R.I.P. IPMI

Version: 1.8.1 (2024-10-22)

mandatory arguments:
  -H HOST, --host HOST  define the host to request. To change the port just
                        add ':portnumber' to this parameter

authentication arguments:
  -u USERNAME, --username USERNAME
                        the login user name
  -p PASSWORD, --password PASSWORD
                        the login password
  -f AUTHFILE, --authfile AUTHFILE
                        authentication file with user name and password
  --sessionfile SESSIONFILE
                        define name of session file
  --sessionfiledir SESSIONFILEDIR
                        define directory where the plugin saves session files
  --sessionlock         prevents multiple sessions and locks the session file
                        when connecting
  --nosession           Don't establish a persistent session and log out after
                        check is finished

optional arguments:
  -h, --help            show this help message and exit
  -w WARNING, --warning WARNING
                        set warning value
  -c CRITICAL, --critical CRITICAL
                        set critical value
  -v, --verbose         this will add all https requests and responses to
                        output, also adds inventory source data to all
                        inventory objects
  -d, --detailed        always print detailed result
  -m MAX, --max MAX     set maximum of returned items for --sel or --mel
  -r RETRIES, --retries RETRIES
                        set number of maximum retries (default: 3)
  -t TIMEOUT, --timeout TIMEOUT
                        set number of request timeout per try/retry (default:
                        7)
  --log_exclude LOG_EXCLUDE
                        a comma separated list of log lines (regex) to exclude
                        from log status checks (--sel, --mel)
  --ignore_missing_ps   ignore the fact that no power supplies are present and
                        report the status of the power subsystem
  --enable_bmc_security_warning
                        return status WARNING if BMC security issues are
                        detected (HPE iLO only)

query status/health information (at least one is required):
  --storage             request storage health
  --proc                request processor health
  --memory              request memory health
  --power               request power supply health
  --temp                request temperature sensors status
  --fan                 request fan status
  --nic                 request network interface status
  --bmc                 request bmc info and status
  --info                request system information
  --firmware            request firmware information
  --sel                 request System Log status
  --mel                 request Management Processor Log status
  --all                 request all of the above information at once

query inventory information (no health check):
  -i, --inventory       return inventory in json format instead of regular
                        plugin output
  --inventory_id INVENTORY_ID
                        set an ID which can be used to identify this host in
                        the destination inventory
  --inventory_name INVENTORY_NAME
                        set a name which can be used to identify this host in
                        the destination inventory
  --inventory_file INVENTORY_FILE
                        set file to write the inventory output to. Otherwise
                        stdout will be used.

General usage

multiple request commands can be combined. Or use --all to query all system information at once

Let's start with an example

/usr/lib64/nagios/plugins/check_redfish/check_redfish.py -H 10.0.0.23 -f /etc/icinga2/ilo_credentials --storage --power

Alternative HTTPS port

If you want to use a different Port then 443 then just add the port to the Host parameter.
Example for Port 8443:

-H 127.0.0.1:8443

Authentication

Credentials can be provided in 3 ways and will be checked in following order:

Authentication file

An authentication credential file can be provided. The structure looks like this:

username=icinga
password=readonlysecret

Environment variables

these two environment vars will be checked

Sessions and session resumption

To avoid delays due to login on every request and flooding the event log with login/logout messages a session resumption was implemented. If the session in the BMC is expired a new session and session file will be created.

IMPORTANT
To actually benefit from this feature you need to set the user session timeout in the BMC to a higher value then your default check interval!

If your default check interval is 5 minutes then the session timeout in the BMC should be at least 6 minutes!

No Session

If no session is required (i.e.: testing, inventory collection) then a --nosession can be added to close session on the BMC properly.

Session file name and location

Per default a session file will be crated in the system/user default temp path. These defaults can be changed with following options:

Use --sessionfiledir to define where the session files should be stored. Use --sessionfile to specify the name of the session file for this particular system.

Session lock file

In order to prevent the race condition of one monitoring instance creating multiple sessions it is possible to use --sessionlock.

Example

options like this:

--sessionfiledir /var/plugin/tmp --sessionfile my-hostname.session

results in following session file:

/var/plugin/tmp/my-hostname.session

WARNING and CRITICAL (health checks only)

you can use warning and critical with following commands:

--mel and --sel (values are passed as "days")
define after how many days' event log entries which have a != OK severity shouldn't be alerted anymore. On most systems it is not possible to set management event log entries as cleared. So entries with a severity of warning would alarm forever. This way they change state while they age.

These settings do NOT apply to HPE iLO "Integrated Management Logs" as these support a "repaired" option to be set.

Example: --mel --critical 1 --warning 3

Detailed (health checks only)

Health status by default will be reported as a summary:

[OK]: All power supplies (2) are in good condition|'ps_1'=122 'ps_2'=109

If multiline output by default is preferred the option --detailed needs to be added

[OK]: Power supply 1 (865408-B21) status is: Ok
[OK]: Power supply 2 (865408-B21) status is: Ok|'ps_1'=121 'ps_2'=109

Debugging

Use option --verbose to check for connection problems. All redfish https requests and responses will be printed.

Max option (health checks only)

This option can be used to limit the results output for event log entries requested by --mel and --sel

Log filter option

With --log_exclude it is possible to define log messages which will be excluded from monitoring. This filter uses regex to match log messages. Multiple filters can be defined comma separated. Use quotes to "escape" messages which include a comma.

Example Usage:

--log_exclude = '"log message, with a comma", another log message, user .* logged in'

Example result:

# ./check_redfish.py '--mel' ...
[CRITICAL]: 2022-03-04T09:48:35-06:00: The iDRAC Service Module communication with iDRAC has ended.
[CRITICAL]: 2022-03-04T09:36:13-06:00: The iDRAC Service Module communication with iDRAC has ended.
[WARNING]: 2022-03-03T09:13:19-06:00: The iDRAC Service Module communication with iDRAC has ended.
[WARNING]: 2022-03-02T15:40:15-06:00: The Integrated NIC 1 Port 1 network link is down.
[WARNING]: 2022-03-02T15:40:15-06:00: The Integrated NIC 1 Port 2 network link is down.
[WARNING]: 2022-03-02T15:40:12-06:00: The iDRAC Service Module communication with iDRAC has ended.
[WARNING]: 2022-03-02T08:16:53-06:00: The iDRAC Service Module communication with iDRAC has ended.

# ./check_redfish.py '--mel' ... --log_exclude "The iDRAC Service Module communication with iDRAC has ended"
[WARNING]: 2022-03-02T15:40:15-06:00: The Integrated NIC 1 Port 1 network link is down.
[WARNING]: 2022-03-02T15:40:15-06:00: The Integrated NIC 1 Port 2 network link is down.

# ./check_redfish.py '--mel' ... --log_exclude 'The iDRAC Service Module communication with iDRAC has ended, network link is down'
[OK]: Manager Event Log contains 2437 OK entries. Most recent notable: [OK]: 2022-03-07T10:00:13-06:00: Successfully logged in using icinga, from 10.1.2.3.

Timeout and Retries

Sometimes an iLO4 BMC can be very slow in answering Redfish request. To avoid getting "retries exhausted" alarms you can increase the number of retries and/or the timeout. The timeout defines the seconds after each try/retry times out. If you increase these values make sure to also adjust the check_timeout setting in your Icinga2 service definition. The total runtime of this plugin (if all retries fail) can be calculated like this: (1. try + num retries) * timeout

The default number of retries is set to 3 and the default timeout is set to 7. In case all retries fail then the plugin would be finished after 28 seconds.

(1 + 3) * 7 = 28

Inventory data

This plugin is able to return a (almost) complete inventory of the queried system. Just add the command option --inventory or -i to get the inventory in a JSON format.

IMPORTANT
This is the first official version and might still change later on. If you encounter problems or have suggestions for changes/improvements then please create a GitHub issue.

Example of power supply inventory (--power --inventory)

{
    "inventory": {
        "chassi": [],
        "fan": [],
        "firmware": [],
        "logical_drive": [],
        "manager": [],
        "memory": [],
        "network_adapter": [],
        "network_port": [],
        "physical_drive": [],
        "power_supply": [
            {
                "bay": 1,
                "capacity_in_watt": 500,
                "chassi_ids": [
                    1
                ],
                "firmware": "1.03",
                "health_status": "OK",
                "id": "0",
                "input_voltage": 224,
                "last_power_output": 110,
                "model": "XXXXXX-B21",
                "name": "HpeServerPowerSupply",
                "operation_status": "Enabled",
                "part_number": "XXXXXX-001",
                "serial": "XXXXXXX",
                "type": "AC",
                "vendor": "CHCNY"
            },
            {
                "bay": 2,
                "capacity_in_watt": 500,
                "chassi_ids": [
                    1
                ],
                "firmware": "1.03",
                "health_status": "OK",
                "id": "1",
                "input_voltage": 228,
                "last_power_output": 110,
                "model": "XXXXXX-B21",
                "name": "HpeServerPowerSupply",
                "operation_status": "Enabled",
                "part_number": "XXXXXX-001",
                "serial": "XXXXXXX",
                "type": "AC",
                "vendor": "CHCNY"
            }
        ],
        "processor": [],
        "storage_controller": [],
        "storage_enclosure": [],
        "system": [],
        "temperature": []
    },
    "meta": {
        "data_retrieval_issues": {},
        "duration_of_data_collection_in_seconds": 1.002901,
        "host_that_collected_inventory": "inventory-collector.example.com",
        "inventory_id": null,
        "inventory_name": null,
        "inventory_layout_version": "1.7.0",
        "script_version": "1.7.1",
        "start_of_data_collection": "2024-02-13T19:09:07+02:00"
    }
}

Verbose output

In case you need more information or want to debug the data you can add the verbose option. This will also add the source_data attribute for each inventory item.

Inventory attributes

You can find a list of attributes for each item here

Inventory file

It is also possible to use the cli option --inventory_file to write the inventory data to a file. This way it can be forwarded or used in an inventory import tool. Here you also might want to use --inventory_id to get a fixed reference to an existing object.

Known limitations

Supported Systems

This plugin is currently tested with following systems

Hewlett Packard Enterprise

Almost all HPE server with iLO4 (>=2.50), iLO5 (>=1.40) or iLO6 should work

IMPORTANT:

Models:

Lenovo

Dell

Huawei

Fujitsu

IMPORTANT:

Cisco

Inspur (limited support)

SuperMicro (limited support)

GIGABYTE (limited support)

License

You can check out the full license here

This project is licensed under the terms of the MIT license.