SuperQ / smokeping_prober

Prometheus style smokeping
Apache License 2.0
554 stars 73 forks source link

Latency going up as more hosts are added #156

Open modena01 opened 2 months ago

modena01 commented 2 months ago

Thanks for smokeping! I am a prometheus newb, so please bear with me. Smokeping was working fine for me at first with a single host, then I tried adding about 100 additional hosts to ping, and the reported ICMP latency went up significantly. I dropped back down to 21 hosts, and latency dropped, but not back to the same level as with 1 target host. is it correct config to have

targets:
 - hosts:
  - my.one.host
  - my.two.host
 - hosts:
  - my.three.host

Is the purpose of different (multiple) "hosts" section merely to have different variables such as interval and size, for different hosts? If smokeping is creating and tracking and reporting buckets to prometheus, is there a valid reason to scrape smokeping from prometheus any more often than say 1min?

My prometheus config is as yet very simple:

- job_name: 'smokeping_prober'
   scrape_interval: 60s
   static_configs:
   - targets: ['localhost:9374']

From the prometheus log, I see a message like this when I have a single ICMP target:

"Waiting 1s between starting pingers" 

but with 21 targets I get:

"Waiting 47.619047ms between starting pingers"

so it is clearly dividing the number of targets into 1000ms, but I cannot find this in the smokeping code, so I guess it is prometheus doing this? I was looking at this trying to figure out why reported latency is going up higher and higher the more ICMP target hosts I add.

Thanks for your help.

SuperQ commented 2 months ago

No, that is message is from an older version of the smokeping_prober. The message was removed when we added dynamic reload support.

Reported latency may be going up because the prober is being starved for CPU and unable to process response packets fast enough.

modena01 commented 2 months ago

Thanks SuperQ, I have now updated to the latest version, here is an example of what happens when I went from 21 hosts, to around 100.

image

do I need to run multiple smokeping instances and split the hosts out per instance? Increasing the interval period does not seem to help.

modena01 commented 2 months ago

I'm looking at needing hundreds (probably 500+) hosts to monitor...

Nachtfalkeaw commented 1 month ago

How often do you ping per second and how many hosts? what packet size for icmp packet? How many CPU cores do you have?

I am pinging a few hundred (200-300 hosts) but with different intervals. some I ping every 200ms and others every 5s. I noticed that at the beginning the CPU load is higher than at later times - maybe the load is distributed. Running "top" I sometimes see smokeping_prober consume 1100% CPU and then other times only 300-500%.

The scrape interval of prometheus defines how the bucket lengt which means each buckt contains all ping results of the scrape interval. If you ping a host every 1s and scrape every 60s you have 60 results in that bucket. This may be "ok" for you but if you have some pings with high latency you do not know if they are at the beginning or the end or spread in the bucket,

So it depends on the use case. I scrape every 15s which contains at least 3 pings for the "every 5s ping" targets.

So back to yout question - I would check your CPU consumtion - maybe - if possible - just add a few more CPU cores and check how the behaviour changes.