CastawayLabs / cachet-monitor

Distributed monitoring plugin for CachetHQ
https://castawaylabs.github.io/cachet-monitor/
MIT License
439 stars 127 forks source link

frequent timeout with many sites to monitor #100

Open osallou opened 5 years ago

osallou commented 5 years ago

Hi, I have a setup with ~100 sites to monitor in config.yml software works but I have lots of incidents/outage on sites with error "net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

I tested one of the site alone inthe config (with a different config.yml but no same server) and it always shows up site up and running, no timeout/errors.

So the problem seems to be when monitor gets "too many" sites to monitor.

osallou commented 5 years ago

After some testing:

it appears that on client.Do call request, requests look pending (manage one after the other), and time is increasing. Displaying lag shows that lag is getting higher at each managed monitor.

It seems that http timeout is set for all monitors and starts at each tick, but request is not yet sent (concurrency, done one after the other). This leads to a timeout for all requests that are managed after timeout value, and lag value is also wrong (as it cumulates response time for all requests). So either you have 1 cpu (go max procs) per monitor and everything will be fine, either you get wrong data (and checks) with too many monitors.