akoutmos / prom_ex

An Elixir Prometheus metrics collection library built on top of Telemetry with accompanying Grafana dashboards
MIT License
606 stars 105 forks source link

[BUG] Memory leak via the ETS tables #49

Closed pallix closed 3 years ago

pallix commented 3 years ago

Describe the bug When defining a few metrics and starting my Elixir application, the ETS table growth constantly until there is no more RAM on the server.

To Reproduce

I use PromEx with default application and beam dashboards and one custom dashboards. I define these metrics in prom_ex:


counter("my_app.msg.in.count",
        description: "The total number of incoming msgs.",
        event_name: [:my_app, :msg, :in]
        ),
counter("my_app.msg.out.count",
        description: "The total number of outgoing msgs.",
        event_name: [:my_app, :msg, :out]
        )

One a second this event is fired:

%Msg{serial: serial, timestamp: timestamp} = msg
meta = %{serial: serial}
latency = System.os_time(:microsecond) - timestamp
measurement = %{latency: latency}
:telemetry.execute([:my_app, :msg, :in], measurement, meta)

and one second this one

%Msg{serial: serial, timestamp: timestamp} = msg
meta = %{serial: serial}
latency = System.os_time(:microsecond) - timestamp
measurement = %{latency: latency}
:telemetry.execute([:my_app, :msg, :out], measurement, meta)

I make a release and start it on a server. It goes out of memory after a while.

Expected behavior Memory does not grow so fast and not so much to take multiple giga.

Environment

Additional context

Start of the app observerd with observer_cli:

observer1

A few minutes later, already much more memory:

observer2

We can see the ETS table takes the memory space. Observability.Metrics is where the PromEx stuff are defined:

obs3

This is memory graph from the application a day before this bug report: checkmk(1)

The first slope on the left is linked to the first test deployment of the application with the metrics. Originally I had also distribution metrics in my code but the memory leak is triggered also without.

akoutmos commented 3 years ago

Thanks for opening up this issue. A couple questions to help figure out what's going on.

During this time of constant memory consumption, have you scraped your metrics at all? With distributions specifically, if metrics are not scrapped, they sit in ETS until the TelemetryMetricsPrometheus.Core.Aggregator aggregates and clears them (see note here https://github.com/beam-telemetry/telemetry_metrics_prometheus_core/blob/main/lib/core.ex#L25-L28)?

If you disable PromEx all together, under the same load what does the memory profile look like?

If it turns out to be the ETS table growing because of no aggregations (due to no scrapes), I can set up something in the PromEx supervision tree to regularly compact that data in ETS to avoid the issue in the future.

pallix commented 3 years ago

During this time of constant memory consumption, have you scraped your metrics at all?

Probably not. I missed that in the documentation, thank you. I'm making now a test with the scrapping. It looks better already ; I will let it run for a while.

I think the current behavior is problematic:

In the end it makes the observability (of potential failures) a potential failure on itself :).

akoutmos commented 3 years ago

100% agree. I meant it more as a test to validate that that was actually the problem as opposed to a solution :). I have a pretty good idea as to how to solve the problem. If I create a branch with a fix can you give it a test drive?

pallix commented 3 years ago

If I create a branch with a fix can you give it a test drive?

Yes! I'm still on the dev phase for this and we have some monitoring on the servers so I could identify a leak.

akoutmos commented 3 years ago

https://github.com/akoutmos/prom_ex/pull/51 Can you give that branch a test drive while I wrap up some unit tests? I tested it out on the example app and it looks like it is working as expected :).

akoutmos commented 3 years ago

That PR is good to go with tests. I'll merge it in after I get the green light from you that it fixes the issue on your end. Thanks for the assistance :).

pallix commented 3 years ago

I will check tomorrow, thanks!

akoutmos commented 3 years ago

Sounds good! Thanks!