PagerDuty / pagerduty-icinga-pl

Icinga Integration for PagerDuty via Perl Wrapper
2 stars 7 forks source link

Perl script blocking itself on multiple icinga events #8

Open ronindesign opened 8 years ago

ronindesign commented 8 years ago

When Icinga triggers multiple issues, the NotificationCommand "notify-service-by-pagerduty" fires multiples times. One of the calls makes it, locking / blocking on file: /tmp/pagerduty/lockfile All of the other instances of notify-service-by-pagerduty fail, with their shell script exiting on the following error:

/var/log/icinga/icinaga.log:

[2016-02-18 13:22:38 -0800] warning/PluginNotificationTask: Notification command for object 'celli.sports-it.com!apt' (PID: 15295, arguments: 'sh' '-c' '/usr/local/bin/pagerduty_icinga.pl enqueue -f pd_nagios_object=service') terminated with exit code 11, output: pagerduty_icinga[15297]: flock /tmp/pagerduty_icinga/lockfile failed: Resource temporarily unavailable Resource temporarily unavailable at /usr/local/bin/pagerduty_icinga.pl line 221.

/var/log/syslog:

pagerduty_icinga[15297]: flock /tmp/pagerduty_icinga/lockfile failed: Resource temporarily unavailable

This happens because each icinga event triggers an enqueue on pagerduty_icinga.pl, which internally calls (or tries to call) the method 'lock_and_flush_queue'. Only one instance gets the successful locks, the others are blocked.

This is not a fatal issue. If I have my cron job setup correctly, 1 minute later, the other entries will be called when 'pagerduty_icinga.pl flush' is called. However, this is still not ideal. The pagerduty_icinga.pl enqueue process should either only enqueue (without attempting flush, and thus blocking itself) or it should implement some passive check timeout / keepalive option in the pearl script for the 'lock_and_flush_queue' section.

These processes finish almost immediately, so a keepalive would only need to be a matter of a few seconds, after which the calls could still be allowed to fail out, there would just now be a small buffer / threshold were multiple calls could be made successively.

oryagel commented 7 years ago

I'm having the same issue. I added a cron job as well to trigger pagerduty_icinga.pl flush but I still have issues were incidents are not created due to the same locking. In the syslog I can see the following errors: flock /tmp/pagerduty_icinga/lockfile failed: Resource temporarily unavailable

ronindesign commented 7 years ago

No edits to this repo in a couple years, doubt this will get a fix.. I personally never resolved this.

oryagel commented 7 years ago

Thanks. I also opened a ticket with PD, let's wait for their thoughts. Did the cron job workaround worked for you or did you still get these locks?

I find it strange that these two products don't have a better integration.

ronindesign commented 7 years ago

So I worked on this over a year ago, but if I remember correctly:

Using the cron job doesn't fix the lock issue (or resulting errors), they still happen regardless. For me, when there was a lock conflict, it simply meant the rest of the queue was delayed 1 minute (until when the cron runs again). Otherwise, all entries were called, and nothing was missed, some were just delayed.

I hope that makes sense? I can elaborate further if needed.

oryagel commented 7 years ago

PagerDuty are working on a new integration using the Nagios agent. It should be out shortly.

ChrisHeerschap commented 6 years ago

So has the new integration using the nagios agent been released?

ronindesign commented 6 years ago

Looks they they have nagios integration now, did a quick search and came up with: https://www.pagerduty.com/docs/guides/nagios-perl-integration-guide/ https://www.pagerduty.com/docs/guides/nagios-integration-guide/

ChrisHeerschap commented 6 years ago

I found that first one as well, but it references https://github.com/PagerDuty/pagerduty-nagios-pl which says:

Latest commit 6fecda3 on Jul 28, 2014

That's even older than this repo.

Looks like it might be https://github.com/PagerDuty/pdagent-integrations ... wonder how much of a pain it'll be to make that work with Icinga2.

ronindesign commented 6 years ago

Looks like there is already some icinga2 support built in: https://github.com/PagerDuty/pdagent-integrations/commit/5f675d9299541aafd79cd5bc4f7cb7e648f4e574 https://github.com/PagerDuty/pdagent-integrations/pull/23 https://www.pagerduty.com/docs/guides/icinga2-integration-guide/

EDIT: added some links

lpossamai commented 5 years ago

I am also having this issue. Just like @ronindesign mentioned. Was anybody able to find a solution for this?

UPDATE: Wrong permissions on folder /tmp/pagerduty_icinga.

chown nagios:nagios -R /tmp/pagerduty_icinga fixed it.