Open ronindesign opened 8 years ago
I'm having the same issue.
I added a cron job as well to trigger pagerduty_icinga.pl flush
but I still have issues were incidents are not created due to the same locking.
In the syslog I can see the following errors: flock /tmp/pagerduty_icinga/lockfile failed: Resource temporarily unavailable
No edits to this repo in a couple years, doubt this will get a fix.. I personally never resolved this.
Thanks. I also opened a ticket with PD, let's wait for their thoughts. Did the cron job workaround worked for you or did you still get these locks?
I find it strange that these two products don't have a better integration.
So I worked on this over a year ago, but if I remember correctly:
Using the cron job doesn't fix the lock issue (or resulting errors), they still happen regardless. For me, when there was a lock conflict, it simply meant the rest of the queue was delayed 1 minute (until when the cron runs again). Otherwise, all entries were called, and nothing was missed, some were just delayed.
I hope that makes sense? I can elaborate further if needed.
PagerDuty are working on a new integration using the Nagios agent. It should be out shortly.
So has the new integration using the nagios agent been released?
Looks they they have nagios integration now, did a quick search and came up with: https://www.pagerduty.com/docs/guides/nagios-perl-integration-guide/ https://www.pagerduty.com/docs/guides/nagios-integration-guide/
I found that first one as well, but it references https://github.com/PagerDuty/pagerduty-nagios-pl which says:
Latest commit 6fecda3 on Jul 28, 2014
That's even older than this repo.
Looks like it might be https://github.com/PagerDuty/pdagent-integrations ... wonder how much of a pain it'll be to make that work with Icinga2.
Looks like there is already some icinga2 support built in: https://github.com/PagerDuty/pdagent-integrations/commit/5f675d9299541aafd79cd5bc4f7cb7e648f4e574 https://github.com/PagerDuty/pdagent-integrations/pull/23 https://www.pagerduty.com/docs/guides/icinga2-integration-guide/
EDIT: added some links
I am also having this issue. Just like @ronindesign mentioned. Was anybody able to find a solution for this?
UPDATE: Wrong permissions on folder /tmp/pagerduty_icinga.
chown nagios:nagios -R /tmp/pagerduty_icinga
fixed it.
When Icinga triggers multiple issues, the NotificationCommand "notify-service-by-pagerduty" fires multiples times. One of the calls makes it, locking / blocking on file: /tmp/pagerduty/lockfile All of the other instances of notify-service-by-pagerduty fail, with their shell script exiting on the following error:
/var/log/icinga/icinaga.log:
/var/log/syslog:
This happens because each icinga event triggers an enqueue on pagerduty_icinga.pl, which internally calls (or tries to call) the method 'lock_and_flush_queue'. Only one instance gets the successful locks, the others are blocked.
This is not a fatal issue. If I have my cron job setup correctly, 1 minute later, the other entries will be called when 'pagerduty_icinga.pl flush' is called. However, this is still not ideal. The pagerduty_icinga.pl enqueue process should either only enqueue (without attempting flush, and thus blocking itself) or it should implement some passive check timeout / keepalive option in the pearl script for the 'lock_and_flush_queue' section.
These processes finish almost immediately, so a keepalive would only need to be a matter of a few seconds, after which the calls could still be allowed to fail out, there would just now be a small buffer / threshold were multiple calls could be made successively.