beam-telemetry / telemetry_metrics_statsd

Telemetry.Metrics reporter for StatsD-compatible metric servers
https://hexdocs.pm/telemetry_metrics_statsd
MIT License
75 stars 42 forks source link

:telemetry.execute blocks the calling process until UDP is sent #45

Closed haljin closed 3 years ago

haljin commented 3 years ago

Hi,

I have noticed that when using telemetry_metrics_statsd invoking :telemetry.execute is blocking the calling process until the metric is send over UDP. That was a bit surprising, as I thought the library would forward all the events to TelemetryMetricsStatsd process that would send things over UDP itself?

Is that an intentional design decision? While UDP sending is usually pretty fast, it can sometimes block up (especially when resolving DNS names and especially in K8s/docker environment) and that means the caller is being delayed by the metrics for quite a significant time. I'd like to change this and the easiest would be to modify this library to instead have a background process send metrics over UDP, rather than the caller. Please let me know if you agree with this approach however. :)

arkgil commented 3 years ago

Forwarding all the events for publishing to a single process (or even a pool of processes) in an asynchronous manner would easily flood this process in a busy system. Having the process executing the event send the metric over UDP applies natural backpressure and adds an overhead to each process sending a metric proportional to the number of metrics it publishes.

I understand that this it might be not ideal in certain circumstances, but it's hard to decide on a best approach without more information. Would you be able to capture a profile of your system to see where the most time is spent when sending a metric?

haljin commented 3 years ago

Right, that makes sense.

I think the problems stems from the fact our system is extremely latency driven and we are doing everything we can to minimize the latencies in our services.

We have previously seen weird behavior when running inside a Docker container. A lookup in the OS DNS cache would take several milliseconds (much slower than on the host machine where its microsends) and that meant that telemetry is delaying us a lot. After switching the BEAM's DNS lookup to its own local cache things got a lot faster - but now perhaps it seems we are reaching a limit again as we are sending more and more metrics. I am also a little worried that the spikage we are seeing in the 99th percentile and above could be related to that.

I'll do a bit more investigation, but maybe the solution is the one we already applied in one of the services - cache metrics locally and then use telemetry_metrics_poller to actually send them on an interval and tuning them carefully not to overload the system. I suppose that's a bit like folsom used to do.

haljin commented 3 years ago

image

Here's a sample trace where I sort of reproduced it - it seems that it is, in fact, the DNS lookup that is the worst part of it.

I suppose it means I have to make the solution on our end, rather than modifying the library - that's ok, feel free to close this issue. :)

juanperi commented 3 years ago

I think i'm also facing this issue. I've been forced to remove ecto reporting from production, because it adds too much overhead. At first I thought it was because we were sending all metrics through a single genserver (that's why I contributed the pool to this library). But doesn't seem to be the case, or at least not the only issue. I never really thought about dns being a problem here, I'll have to investigate it. My production is also in kubernetes, so it might as well be the exact same situation you are on right now

haljin commented 3 years ago

@epilgrim

Here's a comparison when talking to statsd using hostname and IP address (both localhost)

image

The difference is quite big and it seems to be a docker issue on the DNS resolution. The solutions for us was to use an inetrc.conf and feed it into the BEAM

%% -- ERLANG INET CONFIGURATION FILE --
{resolv_conf, '/etc/resolv.conf'}.
{hosts_file, '/etc/hosts'}.
{cache_size, 10000}.
%% refresh every 60s
{cache_refresh, 60000}.
{lookup, [dns, native]}.

Hope it helps. :)

Edit:

This is from our actual deployment in K8s:

image

That's several milliseconds spent just doing metrics. So either the DNS has to be resolved once and an IP fed into telemetry_metrics_statsd (but if the statsd container redeploys we're gone) or it has to be done in the background.

Oh and that is already with the above fix - without it the situation is even worse.

haljin commented 3 years ago

Perhaps something that could be of help here would be to add a function to change the statsd host dynamically through a call to TelemetryMetricsStatsd? That way maybe we could configure it with an IP address but have a background process that does dns resolution and updates every now and then. 🤔 @arkgil what do you think?

arkgil commented 3 years ago

Thank you both for contributing to the discussion, this is very valuable data!

Reconfiguring the reporter in flight would make sense, but would require reattaching all the event handlers, meaning that some metrics would be lost. Still, that might be a valid solution.

What would be interesting to see is a mix of synchronous and asynchronous reporting implemented by Elixir's Logger, i.e. send metrics asynchronously but switch to a synchronous mode once the message queue grows over a configured length.

haljin commented 3 years ago

I can whip up a PR for the reconfigure. I'm not so familiar with the internals of Logger, but I'll also take a look. How about having a configuration option for now that allows for either async or sync sending of metrics? Given how there's already a pool, in some system configs (I hope ours) they should be able to handle the load.

arkgil commented 3 years ago

@haljin I looked at #46 and I'm not sure the async send is the approach we should follow. Accumulating metric updates in the message queues of processes sending them over UDP looks very expensive in terms of memory. It seems like with your requirements you might be better off aggregating metrics inside the BEAM, e.g. by using https://github.com/beam-telemetry/telemetry_metrics_prometheus, or using the approach with poller you described before.

Let's focus on reducing the overhead of DNS lookups, as this is what's hurting performance the most. I like the idea with dynamic update of IP/port, but maybe we can go with something more automatic.

haljin commented 3 years ago

That's fair enough, @arkgil ! I was going to built a DNS poller on top of the library, but I suppose we can add it inside as well and have TelemetryMetricsStatsd process do a DNS resolution on an interval and update its configuration. I'd make the resolution itself and polling period configurable. Should I keep the function to update it manually? What do you think? :)