Uninett / nav

Network Administration Visualized
GNU General Public License v3.0
194 stars 39 forks source link

Add support for Juniper CHASSIS and SYSTEM alerts #2358

Closed lunkwill42 closed 1 year ago

lunkwill42 commented 2 years ago

Juniper devices have a concept of alerts, classified into Chassis alerts and system alerts.

Their SNMP MIBs support fetching a number of current alerts, but no details of what the alerts actually are.

The CNaaS team wants NAV to be able to report that alerts have been flagged, but some design is needed to figure out how this should work in NAV

Also needed:

hmpf commented 2 years ago

@lunkwill42 I cannot find the word "alert" in the juniper mibs, do you mean notification, trap or alarm? If not, where is this described/documented?

lunkwill42 commented 2 years ago

@hmpf I believe you want the JUNIPER-ALARM-MIB. It simply enumerates the number of present "red" alarms and "yellow" alarms in a device. There's a copy of it here: https://github.com/pgmillon/observium/blob/master/mibs/juniper/JUNIPER-ALARM-MIB

You may want to talk to Håvard E for some guidance, as Zino actually employs this MIB (and potentially others)...

hmpf commented 2 years ago

Seems like step one is adding that mib to NAV's own library of mibs then :) Is there a howto for that?

hmpf commented 2 years ago

According to that MIB there are yellow alarms and red alarms, a count for each, whether the status is on, off or other, and a timestamp for when the status last changed. It might be relevant to mark whether these alarms are enabled or not as well (via jnxAlarmRelayMode: other, passOn, cutOff).

lunkwill42 commented 2 years ago

Seems like step one is adding that mib to NAV's own library of mibs then :) Is there a howto for that?

Closest thing we have is this: https://nav.readthedocs.io/en/latest/hacking/adding-environment-probe-support.html?highlight=smidump#dumping-the-mib

hmpf commented 2 years ago

I've asked Håvard E. about what zino does about this.

The JUNIPER-ALARM-MIB only exists on equipment that has a physical craft-interface installed, supplying one actual led to show the colors, and buttons to turn monitoring on or off for various subsystems. Older stuff generally have this interface but at least MX204 and MX10003 does not. There is still a counter for these alarms though, they're just not officially accessible via SNMP.

A workaround is to have a script on the equipment that periodically reads the values (show system alarms) and makes them available via a different MIB, the Utility MIB (mib-jnx-util).

gw> show system alarms | display xml 
<rpc-reply xmlns:junos="http://xml.juniper.net/junos/19.4R0/junos">
    <alarm-information xmlns="http://xml.juniper.net/junos/19.4R0/junos-alarm">
        <alarm-summary>
            <no-active-alarms/>
        </alarm-summary>
    </alarm-information>
    <cli>
        <banner>{master}</banner>
    </cli>
</rpc-reply>

{master}
gw>
hmpf commented 2 years ago

When using the mib-jnx-util, zino uses the OIDs jnxUtilUintValue.82.101.100.65.108.97.114.109 for the red alarm counter and jnxUtilUintValue.89.101.108.108.111.119.65.108.97.114.109 for the yellow alarm counter.

hmpf commented 2 years ago

The steps so far seem to be:

lunkwill42 commented 2 years ago

2368 adds the necessary documentation for some of the command line utilities that are useful for testing SNMP OID compatibility with NAV...

hmpf commented 2 years ago

I've decided on making a new ipdevpoll-plugin just for these weirdos. It does not store the value, just dumps them into eventengine if not zero.

hmpf commented 2 years ago

If converting a mib to python with smidump with the -k flag, also increase the error level above 3 for the -l-flag.

smidump -k -l 5 -f python  ./A.mib > A.py

If it complains failed to locate MIB module foo, get the missing mibs and preload them with the -p-flag:

smidump -k -l 5 -f python  -p ./foo.mib ./A.mib > A.py
lunkwill42 commented 2 years ago

After a design discussion with @knutvi, we have sort of concluded on a way forward.

For the CNaaS team, it's important to know at any one given time what the current number of yellow or red alerts in a Juniper chassis is.

I offered up some interpretations of this, and after some discussion, we decided that the cleanest implementation in a NAV context would be this design:

  1. When ipdevpoll detects that Juniper device D has >0 yellow alerts, it should post a start-state event to this effect, and include the alert count as a variable in the event varmap (the list of arbitrary attributes that can be attached to an event). Likewise, when it sees that Juniper device D has =0 yellow alerts, it should post a corresponding end-state event. This is close to what #2388 is currently honing in on.
  2. How to actually respond to these events is up to the eventengine, and a plugin P to handle these events needs to be written.
  3. When P receives a start event that is not deemed to be a duplicate, it should post a corresponding alert, and copy the alert count variable into the generated AlertHistory record.
  4. When P receives a start event for B that is deemed a duplicate, it should do some further verification: If the seemingly duplicate event has a different alert count then the existing AlertHistory entry, it should:
    1. Close the existing AlertHistory entry, but suppress a regular end-alert from being sent (or the end-user may end up receiving two notifications about the transition, rather than just one)
    2. Post a new pair of Alert and AlertHistory records with the new alert count.
  5. When P receives an end-event that is deemed to match an open AlertHistory record, that record should be closed.

This means that every change in the alert count that does not transition to a count of 0 should cause an entirely new AlertHistory state for B to be created, while any old ones are resolved. Any change in the alert count to 0 should resolve any existing corresponding AlertHistory states.