czerwonk / junos_exporter

Exporter for devices running JunOS to use with https://prometheus.io/
MIT License
198 stars 81 forks source link

poor overall exporter performance when targets down #133

Open matejv opened 3 years ago

matejv commented 3 years ago

We're trying to use junos_exporter to scrape around 2500 Junos devices. After a lot of testing we figured out that that overall exporter performance is dependent on every target that is unreachable. That is: prometheus scrape_duration_seconds for a given target increases, even if junos_collector_duration_seconds stays the same. So obviously something in exporter causes the scrape to take longer.

In our environment we constantly have about 5% or more devices unreachable, so this is an issue for us.

We identified the issue to the mutex lock in connect function. This will cause the function to wait if another thread is trying to establish an SSH connection. Even if the thread waiting is handling a target that already has an established SSH connection.

We tested this with 100 targets with a 60 seconds scrape interval. At first all targets were reachable and exporter performed fine. Then we started turning off a certain number of devices at a time and measured the difference between scrape_duration_secondsandjunos_collector_duration_seconds` for targets that were still reachable (let's call this value scrape lag). This is the promql that we measured:

avg(scrape_duration_seconds{job="junos"}and ignoring(target) junos_up==1) by (exporter) - 
avg(junos_collector_duration_seconds{job="junos"} and ignoring(target) junos_up{job="junos"}==1) by (exporter)

These are the results:

targets down scrape lag [s]
0 0.01
5 1.1
10 10.2
11 12.0
12 13.2
13

At 13 targets down the exporter took over a minute to respond and this was increasing over time. In connection_manager.go there is a constant const timeoutInSeconds = 5. 13 targets down kept the connect function locked for 65 seconds, leaving no more time for scraping of healthy targets.

I decreased timeoutInSeconds to 2 seconds and repeated the same test:

targets down scrape lag [s]
0 0.01
5 0.35
10 0.49
11 0.51
12 0.55
13 0.65

That is much better, but will still bottleneck with a large enough number if targets down.

I looked at this cisco exporter that manages connections in a similar manner but does some more fine grained locking. Looking at that code, I managed to modify junos_exporter to handle targets that are down without impacting other targets. But I must say, I don't fully understand the code.

When tested with 2250 up targets and 320 down targets the exporter now performs without issues (scrape lag is 10ms for healthy targets).

I can provide a PR if you'd like, but I'm not at home in go, so it will be a bit messy :)

AKYD commented 3 years ago

I use this for few devices (less than 30) so I did not run into any big issues, but I can see how, right now, it's not fir for large environments. I think you can provide a PR and the code will be refined as more people look at it.

momorientes commented 3 years ago

I too can confirm this behavior and would appreciate a fix :)

czerwonk commented 3 years ago

I will have a look into the issue. Thanks for pointing it out.