OpenNMS / opennms-pagerduty-plugin

OpenNMS <-> PagerDuty
6 stars 4 forks source link

Add Hold-Down Timer for PagerDuty Events #4

Closed pioto closed 4 years ago

pioto commented 4 years ago

There's still more work to do on this, but the big picture plan is here:

Fixes #3

j-white commented 4 years ago

👍 Good to merge once your happy with it.

pioto commented 4 years ago

Hrm, testing, and I see 2 threads both waiting to take a task off the queue, both named for the same service config PID, which feels "weird", but may be part of how the DelayQueue works or something?

And no notification went out after waiting for my desired time, so I'll have to slap some more debug logs in here and trace it out more...

pioto commented 4 years ago

Not sure if it's a regression from this change, or if something else has changed recently, but seems I can't use and alarm.type.name == "PROBLEM" in my JEXL filter; seems the type is null for all the alarms when I try testing some expressions.

pioto commented 4 years ago

Another issue I see is that, right now, when an alarm is cleared, we end up scheduling a task for its event, because that comes in via handleNewOrUpdatedAlarm(). So we need to rework this to have the same "unschedule" logic here when the alarm severity is CLEARED.

pioto commented 4 years ago

Main issue seems to be that tasks get added to the queue, but the .take() call is never returning, even after waiting for the configured delay.

My task consumer thread seems to just be stuck waiting like this:

"PagerDuty-Forwarder-org.opennms.plugins.pagerduty.services.ede23f8d-fad2-4c77-bc3d-1c40d0a5995c-0" Id=2743 in TIMED_WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fd8cedd
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.DelayQueue.take(DelayQueue.java:223)
    at org.opennms.integrations.pagerduty.PagerDutyForwarder$TaskConsumer.run(PagerDutyForwarder.java:331)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)    

.... ahha!

I failed to implement the contract for Delayed, as described on https://stackoverflow.com/questions/24730988/delayqueue-in-java-not-working-as-expected ... so I'll rework the PagerDutyForwarderTask and see where that gets me...

pioto commented 4 years ago

OK, I've now gotten this tested and it works.

Main outstanding questions are about handling the delay around alarm updates. For example, if someone acks an alarm in OpenNMS which is still in the DelayQueue, the current behavior is to remove it from the queue, so no PagerDuty notification happens.

I think it's also possible for some sort of alarms to have updates happen that'll end up getting delayed as a result of this, so it may be better to rework this to basically update the enqueued task with the new alarm state, but send the notification based off of the initial delay timer?

But, it does work, so possible this is "good enough"? And I think that the behavior is effectively unchanged for existing service configs w/o a holdDownDelay configured, so this is probably ready to merge?

j-white commented 4 years ago

Thanks for putting this together!