TwiN / gatus

β›‘ Automated developer-oriented status page
https://gatus.io
Apache License 2.0
6.14k stars 410 forks source link

Support for DNS cache #464

Open antoinekh opened 1 year ago

antoinekh commented 1 year ago

Describe the feature request

Add a DNS cache for better performance with a very large number of hosts

Why do you personally want this feature to be implemented?

We try to use gatus in the datacenter to monitor several 100 or even 1000 hosts with icmp. It works well with IP but it takes a long time with DNS resolutions.

Would it be possible to implement a DNS cache to avoid redoing the query at each test?

How long have you been using this project?

1 month

Additional information

No response

TwiN commented 1 year ago

Why not spin up a dns proxy that supports caching?

I understand the issue you're having, but as a monitoring tool, caching DNS may cause Gatus to report an inaccurate status for an endpoint.

antoinekh commented 1 year ago

Hi,

Because we already use internal DNS that do caching. We use gatus on demand but we are testing how far we can go.

And when there are more than 10k icmp tests per minute with hostname and not IP, the impact of the DNS cache is in my opinion not negligeable. We could reduce the number of requests by two with only the ping.

I was thinking of something like 2 options:

Maybe we are going too far with Gatus too and we are on a edge use case. We did some test with 2000 hosts, pinging the hostname every 5 seconds and the time for a fullyload (pingin all the hosts) is near 30min. More than 20k tests per minutes.

skhokhlov commented 1 year ago

It can cause inaccurate results for endpoint behind DNS load balancers because the LB response will be cached

antoinekh commented 1 year ago

Yes it is, that's why it's an option I would like but not activated by default. Only for specific cases

TwiN commented 1 year ago

We did some test with 2000 hosts, pinging the hostname every 5 seconds and the time for a fullyload (pingin all the hosts) is near 30min. More than 20k tests per minutes.

@antoinekh Just so you know, on start, Gatus waits 777ms between each worker it starts, and there's one worker (goroutine) per endpoint, so if you have 2000 endpoints, technically, it'd take ~26 minutes (777ms * 2000) for all endpoints to have been evaluated at least once. After that, it's once every interval (likely slower if the monitoring lock is enabled, which it is by default).

Regardless, Gatus is not currently made for monitoring such a high number of endpoints. The backend can handle it, but the UI is probably unusable as I have yet to implement pagination (#150). While Gatus can be used for multiple use cases, even load testing, this seems like too much of an edge case.

Still, I'm okay with leaving the feature request open, and if people are also interested in a feature like that, they can πŸ‘ the feature request & if there's enough interest, then we can look into implementing it.

antoinekh commented 1 year ago

@TwiN Ok I understand the start up time, is there any particular reason for this number?

We have monitoring lock disabled. We don’t use the GUI, we use the /metrics and data are send to grafana where we have dashboards with filters on group, name, host down..

I will probably do the test again with a docker local dns cache and divide the 777ms by 10x to see what it can give.

In any case, Gatus is a great tool in its normal use, thanks for that πŸ‘πŸ»πŸ‘πŸ»

TwiN commented 1 year ago

I have to say that 777ms is just a random number I picked that seemed to make sense, but the logic behind it is that even if you have the monitoring lock disabled, you don't want all the goroutines start at the same time and evaluate the endpoint at the same time, as that will both cause a spike in CPU and likely cause an increase in latency, which will cause inaccurate request duration (and possibly even timeouts if there's a large number of endpoints).

Let's say it was midnight, you have 100 endpoints with 30s interval and you just started Gatus (with the monitoring lock disabled). Without the 777ms of delay between the start of each goroutine, they would all start at roughly 00:00:00. On the other hand, with the 777ms of delay, the first one would start at 00:00:00:00, second one would start at 00:00:00:777, third one would start at 00:00:01:554, etc. Of course this doesn't mean they won't ever overlap as the interval gets mixed into this, but it reduces the odd.

Now that being said, perhaps I should decrease that 777ms to something like 111ms or 222ms, which would make starting your very last endpoint, assuming you have 2000 endpoints configured, take 222s (0.111s*2000) or 444s (0.222s*2000) respectively.

heitorPB commented 1 year ago

Does Gatus resolve DNS records internally? If not, than one can configure a local DNS resolver (e.g. dnsmasq or systemd-resolved) to cache things at the server level.

antoinekh commented 1 year ago

I have to say that 777ms is just a random number I picked that seemed to make sense, but the logic behind it is that even if you have the monitoring lock disabled, you don't want all the goroutines start at the same time and evaluate the endpoint at the same time, as that will both cause a spike in CPU and likely cause an increase in latency, which will cause inaccurate request duration (and possibly even timeouts if there's a large number of endpoints).

Let's say it was midnight, you have 100 endpoints with 30s interval and you just started Gatus (with the monitoring lock disabled). Without the 777ms of delay between the start of each goroutine, they would all start at roughly 00:00:00. On the other hand, with the 777ms of delay, the first one would start at 00:00:00:00, second one would start at 00:00:00:777, third one would start at 00:00:01:554, etc. Of course this doesn't mean they won't ever overlap as the interval gets mixed into this, but it reduces the odd.

Now that being said, perhaps I should decrease that 777ms to something like 111ms or 222ms, which would make starting your very last endpoint, assuming you have 2000 endpoints configured, take 222s (0.111s2000) or 444s (0.222s2000) respectively.

Thanks for the info, I tested by changing this parameter from 777 -> 57ms (1k endpoits per minute) and it's works well.

For DNS cache, I have tried go-dnsmasq docker, the reduced load on Gatus was lost in the go-dnsmasq. But go-dnsmasq remains a solution if I want to avoid an unnecessary load on production DNS.

The only way to really save load would be to integrate a DNS cache into Gatus, but I am not sure if it's worth the work to integrate it.