Closed pioto closed 4 years ago
👍 Good to merge once your happy with it.
Hrm, testing, and I see 2 threads both waiting to take a task off the queue, both named for the same service config PID, which feels "weird", but may be part of how the DelayQueue works or something?
And no notification went out after waiting for my desired time, so I'll have to slap some more debug logs in here and trace it out more...
Not sure if it's a regression from this change, or if something else has changed recently, but seems I can't use and alarm.type.name == "PROBLEM"
in my JEXL filter; seems the type is null for all the alarms when I try testing some expressions.
Another issue I see is that, right now, when an alarm is cleared, we end up scheduling a task for its event, because that comes in via handleNewOrUpdatedAlarm()
. So we need to rework this to have the same "unschedule" logic here when the alarm severity is CLEARED
.
Main issue seems to be that tasks get added to the queue, but the .take()
call is never returning, even after waiting for the configured delay.
My task consumer thread seems to just be stuck waiting like this:
"PagerDuty-Forwarder-org.opennms.plugins.pagerduty.services.ede23f8d-fad2-4c77-bc3d-1c40d0a5995c-0" Id=2743 in TIMED_WAITING on lock=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fd8cedd
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.DelayQueue.take(DelayQueue.java:223)
at org.opennms.integrations.pagerduty.PagerDutyForwarder$TaskConsumer.run(PagerDutyForwarder.java:331)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
.... ahha!
I failed to implement the contract for Delayed
, as described on https://stackoverflow.com/questions/24730988/delayqueue-in-java-not-working-as-expected ... so I'll rework the PagerDutyForwarderTask
and see where that gets me...
OK, I've now gotten this tested and it works.
Main outstanding questions are about handling the delay around alarm updates. For example, if someone acks an alarm in OpenNMS which is still in the DelayQueue
, the current behavior is to remove it from the queue, so no PagerDuty notification happens.
I think it's also possible for some sort of alarms to have updates happen that'll end up getting delayed as a result of this, so it may be better to rework this to basically update the enqueued task with the new alarm state, but send the notification based off of the initial delay timer?
But, it does work, so possible this is "good enough"? And I think that the behavior is effectively unchanged for existing service configs w/o a holdDownDelay
configured, so this is probably ready to merge?
Thanks for putting this together!
There's still more work to do on this, but the big picture plan is here:
holdDownDelay
configuration property, which takes a string injava.time.Duration
format (e.g.PT5M
for "5 minutes"). If not specified, defaults to a delay of 0 (java.time.Duration.ZERO
)DelayQueue
to queue up these notification tasks, which only releases them from the queue after the delay has expired.Fixes #3