Open matejv opened 3 years ago
I use this for few devices (less than 30) so I did not run into any big issues, but I can see how, right now, it's not fir for large environments. I think you can provide a PR and the code will be refined as more people look at it.
I too can confirm this behavior and would appreciate a fix :)
I will have a look into the issue. Thanks for pointing it out.
We're trying to use junos_exporter to scrape around 2500 Junos devices. After a lot of testing we figured out that that overall exporter performance is dependent on every target that is unreachable. That is: prometheus
scrape_duration_seconds
for a given target increases, even ifjunos_collector_duration_seconds
stays the same. So obviously something in exporter causes the scrape to take longer.In our environment we constantly have about 5% or more devices unreachable, so this is an issue for us.
We identified the issue to the mutex lock in connect function. This will cause the function to wait if another thread is trying to establish an SSH connection. Even if the thread waiting is handling a target that already has an established SSH connection.
We tested this with 100 targets with a 60 seconds scrape interval. At first all targets were reachable and exporter performed fine. Then we started turning off a certain number of devices at a time and measured the difference between scrape_duration_seconds
and
junos_collector_duration_seconds` for targets that were still reachable (let's call this value scrape lag). This is the promql that we measured:These are the results:
At 13 targets down the exporter took over a minute to respond and this was increasing over time. In
connection_manager.go
there is a constantconst timeoutInSeconds = 5
. 13 targets down kept theconnect
function locked for 65 seconds, leaving no more time for scraping of healthy targets.I decreased
timeoutInSeconds
to 2 seconds and repeated the same test:That is much better, but will still bottleneck with a large enough number if targets down.
I looked at this cisco exporter that manages connections in a similar manner but does some more fine grained locking. Looking at that code, I managed to modify junos_exporter to handle targets that are down without impacting other targets. But I must say, I don't fully understand the code.
When tested with 2250 up targets and 320 down targets the exporter now performs without issues (scrape lag is 10ms for healthy targets).
I can provide a PR if you'd like, but I'm not at home in go, so it will be a bit messy :)