akoutmos / prom_ex

An Elixir Prometheus metrics collection library built on top of Telemetry with accompanying Grafana dashboards
MIT License
596 stars 104 forks source link

[BUG] Ecto dashboard not showing the repo name #190

Open pallix opened 1 year ago

pallix commented 1 year ago

Describe the bug

When using the ecto plugin, the drop list menu on the dashboard does not show the repo:

grafana

Environment

wkpatrick commented 1 year ago

Did you make sure that the PromEx application is first in the children in application.ex? It will not catch the init events if it does not start before ecto/oban/etc.

pallix commented 1 year ago

Did you make sure that the PromEx application is first in the children in application.ex? It will not catch the init events if it does not start before ecto/oban/etc.

Thanks for reaching out. No, it's not always started as first child in our applications. So it needs the init events to find the repo names?

wkpatrick commented 1 year ago

Correct, I had the exact same behavior, which I fixed by putting PromEx as the first child to be started.

yordis commented 1 year ago

Are we sure about that, @wkpatrick? I am not disagreeing, but I am getting confused by the code:

The telemetry attachment should send the repo tag_values, which is calculated here:

https://github.com/akoutmos/prom_ex/blob/6a002fb029cf00eeadfcfef7ce984cd7a0f83493/lib/prom_ex/plugins/ecto.ex#L246

What could happen is that you have yet to receive any information; therefore, that list is empty.

But I don't see any requirement about the order of the processes.

You do need a :ecto_repos configuration under your otp_app: https://github.com/akoutmos/prom_ex/blob/6a002fb029cf00eeadfcfef7ce984cd7a0f83493/lib/prom_ex/plugins/ecto.ex#L55 or pass :repos options https://github.com/akoutmos/prom_ex/blob/6a002fb029cf00eeadfcfef7ce984cd7a0f83493/lib/prom_ex/plugins/ecto.ex#L54.

@pallix, please confirm,

  1. You are setting up the repos configuration for the Ecto plugin, or if you are not using the repos configuration, you have a configured ecto_repos under the namespace of the otp_app named pass a configuration.

  2. You ran the application and generated metrics to look at.

wkpatrick commented 1 year ago

@yordis

The ecto dashboard parses the repo name out of the init metrics label_values(<repo name>_prom_ex_ecto_repo_init_status_info, repo) Which are event based and defined here: https://github.com/akoutmos/prom_ex/blob/6a002fb029cf00eeadfcfef7ce984cd7a0f83493/lib/prom_ex/plugins/ecto.ex#L74

The event is only fired once when the Repo supervisor starts. Therefore, if the prom_ex application starts after Ecto, it will not have received the telemetry event for init. Which means it does not generate the promethus metric format for the init, so the dashboard is unable to generate the label values for it. https://github.com/elixir-ecto/ecto/blob/05e4669ab28a7559ff5fcc0af7216faf9df7b997/lib/ecto/repo/supervisor.ex#L179

yordis commented 1 year ago

Right right, but that is true only for the init, right?! (just confirming) but other metrics should eventually fill up that field, no?

Being said, probably prudent to initiate PromEx before, regardless.

@pallix please confirm that reordering the processes fix the issue.

wkpatrick commented 1 year ago

Nope. _prom_ex_ecto_repo_init_status_info is only filled on init. Other metrics contain the repo name, but only init_status_info is parsed to generate the labels in the dashboard.

pallix commented 1 year ago

@pallix please confirm that reordering the processes fix the issue.

I will try to test it this week! Thank you.

pallix commented 1 year ago

The suggested fix works, than you very much :hugs:

yordis commented 1 year ago

@pallix do you mind creating a PR updating the docs, or closing the issue?

pallix commented 1 year ago

@pallix do you mind creating a PR updating the docs, or closing the issue?

Shouldn't it be solve at the code level and the order of the children made irrelevant? or it's not practical?

yordis commented 1 year ago

🤷🏻 I am not sure. I think we can take baby steps on the topic. The documentation should be enough for now, and I will commit to fixing it. The following person could do that if you don't have time.

@wkpatrick is more knowledgeable here, so maybe he knows some limitations and whatnot.

Give it a try if you want to 🚀

yordis commented 1 year ago

Peeps, any updates over here?!

akoutmos commented 1 year ago

@pallix do you mind creating a PR updating the docs, or closing the issue?

Shouldn't it be solve at the code level and the order of the children made irrelevant? or it's not practical?

Given that the Telemetry events are synchronous and not captured/stored by any other mechanism, the PromEx supervision tree must be started prior to starting the Repo supervision trees or else the Telemetry events will be missed with no way to retroactively get their measurements and metadata. I'll have to look into whether there is a better way to capture Ecto init data (perhaps don't even attach to the init event and just query the repo module directly for the same data), but for now PromEx should come first in the supervision tree. If anyone has some bandwidth, a documentation update would be appreciated!

yordis commented 1 year ago

@akoutmos do you think we should tell people to start the PromEx supervisor as soon as possible? Even before forming clusters? or what????

pallix commented 1 year ago

I will try to find time for a pull request in the next days

pallix commented 1 year ago

Actually it's already in the doc since two years :-):

(be sure to add it to the top of the supervisor children list so that you do not miss any init-style events from other processes like Ecto.Repo for example)

I suggest the two followings changes to improve the doc:

I will adress both with a PR.