DataDog / dogstatsd-ruby

A Ruby client for DogStatsd
https://www.datadoghq.com/
MIT License
179 stars 137 forks source link

Support IP changes of statsd hostname #284

Open Goltergaul opened 1 year ago

Goltergaul commented 1 year ago

we have the issue that the ip address of our statsd host is changing regularly on our kubernetes cluster due to some automated updates. In this case the current implementation continues to send UDP packets to the old IP address.

This pr addresses this issue by resolving the hostname to its ip all 60 seconds. in case the ip changed, it simply reconnects the UDP socket.

This is a bit of a draft to get your input. Please let me know if this is something you would be willing to merge

FYI this currently failes the allocation specs as I did not adapt those yet

Goltergaul commented 1 year ago

Todo: Disable this by default and make interval configurable

TheWudu commented 1 year ago

Looks good to me :-)

erikpaperik commented 1 year ago

Looks good :+1:

remeh commented 1 year ago

Hey @Goltergaul

Thanks for you contribution.

A problem with this implementation is that the Resolv.getaddress call can be slow (even if returning the same IP). It will block the thread during the DNS resolve, and thus the application calling the library. Even if fast, the DNS resolve could start to fail for some reasons and nothing would indicate an error with the current implementation + the IP would not be known anymore by the client instance (while still being possibly reachable).

A mitigation to your problem which can be considered with the dogstatsd-ruby client and which would mimick your implementation would be that every 60s, your application calls the flush method and then re-creates the client instance. WDYT?

andreaseger commented 1 year ago

my 2 cents, IMHO this should be handled (if enabled) in the statsd library. statsd is often used indirectly by other libraries to report metrics and for example not directly in our applications. But I can see your concern about slow dns resolution. Wondering if the DNS resolve could maybe be moved into a fiber(more lightweight than a thread and should be ideal here as we're waiting for IO) so it won't block the send_message operation.

Also the other concerns you mention like DNS errors while the old statsd ip is still reachable can be handled with some added error handling.

Goltergaul commented 1 year ago

@remeh Thanks for your comment. While your proposed mitigation works I would expect from such a library that it can reconnect itself automatically in such a case, as library user I don't want to take care of this myself.

As for the problems: For when the dns resolution fails, it could be ignored and the socket left intact. Then retry the resolution only again at the next interval. So it would not change in terms of behavior compared to what it is doing now.

As for speed considerations when the dns resolution is slow. Maybe add a low timeout, e.g.

Resolv::DNS.open do |dns|
  dns.timeouts = 0.1
  ip = dns.getaddress(hostname)
end

This would mean the thread is blocked at max 100ms every minute (not sure if that is enough though). Also this behavior could be made configurable with custom timeouts or even disable it completely. Additionally i assume the dns lookup via resolv should also profit from some low level dns caching? Someone here knows more about this? At least it is pretty fast if I try to resolve 10000 times the same domain

Or as suggested above, we could make it async with a Fiber or so (sounds like the most robust one) I'm not an expert on Fibers, but I think this will work only with ruby 3.