canonical / charm-prometheus-libvirt-exporter

A charm that provides per-domain metrics related to CPU, memory, disk and network usage using libvirt exporter.
0 stars 2 forks source link

Nrpe check alert for socket timeout #10

Closed jneo8 closed 9 months ago

jneo8 commented 9 months ago

In overloaded environment

curl http://127.0.0.1:9177/metrics is taking > 30 seconds,

while the check timeouts after 10 seconds (default value) and this parameter is not configurable.

A config option to increase it should be added to avoid false negatives.


Imported from Launchpad using lp2gh.

jneo8 commented 9 months ago

(by peter-sabaini) Some notes on this: on an affected system I can see the prometheus-libvirt-exporter regularly timing out on a lock held by a (migration?) process inside libvirtd

Error message:

... Failed to scrape metrics: virError(Code=68, Domain=10, Message='Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePrepare3Params)')

There isn't too much the p-l-e can do about this. As peppepetra suggests, we could make the check_http nrpe check configurable with a timeout to reduce false negatives here.

However, independently of the nrep check timeout we will lose libvirt-exporter data precision as during the events when we can't get a libvirtd connection (or it becomes unresponsive) we won't scrape any data either

jneo8 commented 9 months ago

(by zzehring) Hit this again in an environment. For some additional context, it looks like the timeouts began right when many migrations were happening. It appears that migrations cause the exporter endpoint to become unresponsive for some time (as stated, looks like ~30 seconds). This definitely looks to be a problem with the exporter app and not the charm.

jneo8 commented 9 months ago

(by addyess) This could be moved to /etc/cron.d/ on a 5m in interval, and save results to /var/lib/nagios/prometheus-libvirt-exporter.out

jneo8 commented 9 months ago

(by thogarre) This is similar to the bug prometheus-ceph-exporter (lp#1895558), and could potentially be fixed with the same MR - https://git.launchpad.net/charm-prometheus-ceph-exporter/commit/?id=be9aa5e3fdbfdddc6fafcc5705ee96fe33ea728e

jneo8 commented 9 months ago

(by stephanpampel) I created a merge proposal that is similar to the changes in charm-prometheus-ceph-exporter.

However the timeout also depends on nagios server settings: check_timeout (https://charmhub.io/nagios/configure#check_timeout) and service_check_timeout (https://charmhub.io/nagios/configure#service_check_timeout), this might also need to be increased to make an improvement.

jneo8 commented 9 months ago

(by stephanpampel) After discussions about this bug we decided that it would be best to just check if prometheus-libvirt-exporter is reachable and not if it can get metrics from libvirt. As libvirt can be blocked by migrations.

Therefore I changed the check to query '/' instead of '/metrics' and make sure we get a valid response (status code 200). https://code.launchpad.net/~stephanpampel/charm-prometheus-libvirt-exporter/+git/charm-prometheus-libvirt-exporter/+merge/414344