OpenNMS / opennms-pagerduty-plugin

OpenNMS <-> PagerDuty
6 stars 4 forks source link

OpenNMS PagerDuty Plugin CircleCI

arch

Overview

This plugin allows you to create PagerDuty incidents that correspond to alarms in OpenNMS by integrating with the Events API v2.

OpenNMS alarms map naturally to incidents in PagerDuty:

This plugin is compatible with OpenNMS Horizon 26.1.3 or higher.

Usage

Download the plugin's .kar file into your OpenNMS deploy directory i.e.:

sudo wget https://github.com/OpenNMS/opennms-pagerduty-plugin/releases/download/v0.1.3/opennms-pagerduty-plugin.kar -P /opt/opennms/deploy/

Configure the plugin to be installed when OpenNMS starts:

echo 'opennms-plugins-pagerduty wait-for-kar=opennms-pagerduty-plugin' | sudo tee /opt/opennms/etc/featuresBoot.d/pagerduty.boot

Access the Karaf shell and install the feature manually to avoid having to restart:

feature:install opennms-plugins-pagerduty

Configure global options (affects all services for this instance):

config:edit org.opennms.plugins.pagerduty
property-set client OpenNMS
property-set alarmDetailsUrlPattern 'http://"YOUR-OPENNMS-IP-ADDRESS"/opennms/alarm/detail.htm?id=%d'
config:update

Use the IP address of your OpenNMS server (e.g., 127.0.0.1:8980).

Configure services:

config:edit --alias core --factory org.opennms.plugins.pagerduty.services
property-set routingKey "YOUR-INTEGRATION-KEY-HERE"
config:update

Use the value of the "Integration Key" as the routingKey in the service integrations. By default, you will receive notifications for all alarms. Use a JEXL expression to filter the types of notifcations you receive. For example,property-set jexlFilter 'alarm.reductionKey =~ ".*trigger.*"' will forward only alarms with the label "trigger" to PagerDuty.

The plugin supports handling multiple services simultaneously - use a different alias for each of these when configuring.

Alarm filtering

We currently only support JEXL expressions for controlling which alarms get forwarded to a given service.

You can use the opennms-pagerduty:eval-jexl command to help test expressions before committing them to configuration i.e.:

admin@opennms> opennms-pagerduty:eval-jexl 'alarm.reductionKey =~ ".*trigger.*"'
No alarms matched (out of 1 alarms.)
admin@opennms> opennms-pagerduty:eval-jexl 'alarm.reductionKey =~ ".*"'
MATCHED: ImmutableAlarm{reductionKey='uei.opennms.org/devjam/2020/minecraft/playerEnteredZone:mousebar:x_intercept', id=109, node=null, managedObjectInstance='null', managedObjectType='zone', type=null, severity=WARNING, attributes={}, relatedAlarms=[], logMessage='x_intercept has entered mousebar.', description='x_intercept has entered mousebar.', lastEventTime=2020-07-30 16:31:26.904, firstEventTime=2020-07-30 16:31:26.904, lastEvent=ImmutableDatabaseEvent{uei='uei.opennms.org/devjam/2020/minecraft/playerEnteredZone', id=195, parameters=[ImmutableEventParameter{name='zone', value='mousebar'}, ImmutableEventParameter{name='player', value='x_intercept'}]}, acknowledged=false}

JEXL Expression Examples

The OpenNMS PagerDuty integration leverages Apache Commons JEXL Syntax to allow filtering the alarms that get passed to PagerDuty.

Each expression will have a single alarm variable set, which is an Alarm object with details about the alarm. If the expression evaluates to true, then a PagerDuty event is created for this alarm.

All alarms for a nodes with "Test" and "Servers" categories assigned
admin@opennms> property-set jexlFilter '"Servers" =~ alarm.node.categories and "Test" =~ alarm.node.categories'

This leverages the =~ operator to mean "'Servers' is in the alarm's node's categories" and "'Test' is in the alarm's node's categories".

Excluding some alarms from the above

admin@opennms> property-set jexlFilter '"Servers" =~ alarm.node.categories and "Test" =~ alarm.node.categories and alarm.reductionKey !~ "^uei\.opennms\.org/generic/traps/SNMP_Authen_Failure:.*"'

This leverages the !~ operator to mean "the alarm reduction key does not match the given regex", in addition to the "in" sense of =~ as shown above.

Only Alarms That Can Auto-Resolve

admin@opennms> property-set jexlFilter '"Servers" =~ alarm.node.categories and "Test" =~ alarm.node.categories and alarm.type.name == "PROBLEM"'

This limits to only alarms for certain categories of nodes that have a resolution. Some alarms have no "clearing" event, so they would stay present in PagerDuty forever unless manual action is taken, or certain special configuration is used within PagerDuty to expire the events.

Hold-Down Timer (Delayed Notifications)

Some alarms may quick resolve themselves, especially occasional brief outages from the OpenNMS service pollers.

To be able to get the full benefits of the OpenNMS Downtime Model, you can specify a hold-down timer of at least a few minutes, to trade off instant notification of issues for reduced false positives (issues that resolved themselves before you were able to look at them).

Similar to configuring the jexlFilter above, you can edit a specific service's configuration. To find the specific configuration to edit, use config:list as shown below, then use config:edit to edit the Pid of that specific service's configuration:

admin@opennms> config:list '(service.factoryPid=org.opennms.plugins.pagerduty.services)'
----------------------------------------------------------------
Pid:            org.opennms.plugins.pagerduty.services.bbc99bb4-bc56-4d56-b35c-14066b6e2dcf
FactoryPid:     org.opennms.plugins.pagerduty.services
BundleLocation: ?
Properties:
   felix.fileinstall.filename = file:/opt/opennms/etc/org.opennms.plugins.pagerduty.services-test-servers.cfg
   jexlFilter = "Servers" =~ alarm.node.categories and "Test" =~ alarm.node.categories and alarm.type.name == "PROBLEM"
   routingKey = YOUR-INTEGRATION-KEY-HERE
   service.factoryPid = org.opennms.plugins.pagerduty.services
   service.pid = org.opennms.plugins.pagerduty.services.bbc99bb4-bc56-4d56-b35c-14066b6e2dcf
admin@opennms> config:edit org.opennms.plugins.pagerduty.services.bbc99bb4-bc56-4d56-b35c-14066b6e2dcf
admin@opennms> property-set holdDownDelay "PT5M"
admin@opennms> config:update

The holdDownDelay property should be a string that follows ISO-8601 duration format, as supported by java.time.Duration.parse().

Handling notification failures

In cases where forwarding an alarm to PagerDuty fails, the plugin will generate a uei.opennms.org/pagerduty/sendEventFailed locally that will trigger an alarm. The event contains details on the error that occurred. You could use it to trigger alternate notification mechanisms.

Developing

Build and install the plugin into your local Maven repository using:

mvn clean install

OpenNMS normally runs as root, so make sure the artifacts are installed in /root/.m2 or try making /root/.m2 symlink to your user's repository

From the OpenNMS Karaf shell:

feature:repo-add mvn:org.opennms.plugins/pagerduty-karaf-features/1.0.0-SNAPSHOT/xml
feature:install opennms-plugins-pagerduty

Update automatically:

bundle:watch *

Debugging

Add the following lines to $OPENNMS_HOME/etc/org.ops4j.pax.logging.cfg:

# PagerDuty plugin
log4j2.logger.pd-plugin.name = org.opennms.integrations.pagerduty
log4j2.logger.pd-plugin.level = DEBUG
log4j2.logger.pd-client.name = org.opennms.pagerduty
log4j2.logger.pd-client.level = DEBUG