Closed pallix closed 3 years ago
Thanks for opening up this issue. A couple questions to help figure out what's going on.
During this time of constant memory consumption, have you scraped your metrics at all? With distributions specifically, if metrics are not scrapped, they sit in ETS until the TelemetryMetricsPrometheus.Core.Aggregator
aggregates and clears them (see note here https://github.com/beam-telemetry/telemetry_metrics_prometheus_core/blob/main/lib/core.ex#L25-L28)?
If you disable PromEx all together, under the same load what does the memory profile look like?
If it turns out to be the ETS table growing because of no aggregations (due to no scrapes), I can set up something in the PromEx supervision tree to regularly compact that data in ETS to avoid the issue in the future.
During this time of constant memory consumption, have you scraped your metrics at all?
Probably not. I missed that in the documentation, thank you. I'm making now a test with the scrapping. It looks better already ; I will let it run for a while.
I think the current behavior is problematic:
In the end it makes the observability (of potential failures) a potential failure on itself :).
100% agree. I meant it more as a test to validate that that was actually the problem as opposed to a solution :). I have a pretty good idea as to how to solve the problem. If I create a branch with a fix can you give it a test drive?
If I create a branch with a fix can you give it a test drive?
Yes! I'm still on the dev phase for this and we have some monitoring on the servers so I could identify a leak.
https://github.com/akoutmos/prom_ex/pull/51 Can you give that branch a test drive while I wrap up some unit tests? I tested it out on the example app and it looks like it is working as expected :).
That PR is good to go with tests. I'll merge it in after I get the green light from you that it fixes the issue on your end. Thanks for the assistance :).
I will check tomorrow, thanks!
Sounds good! Thanks!
Describe the bug When defining a few metrics and starting my Elixir application, the ETS table growth constantly until there is no more RAM on the server.
To Reproduce
I use PromEx with default application and beam dashboards and one custom dashboards. I define these metrics in prom_ex:
One a second this event is fired:
and one second this one
I make a release and start it on a server. It goes out of memory after a while.
Expected behavior Memory does not grow so fast and not so much to take multiple giga.
Environment
Additional context
Start of the app observerd with
observer_cli
:A few minutes later, already much more memory:
We can see the ETS table takes the memory space. Observability.Metrics is where the PromEx stuff are defined:
This is memory graph from the application a day before this bug report:
The first slope on the left is linked to the first test deployment of the application with the metrics. Originally I had also
distribution
metrics in my code but the memory leak is triggered also without.