akoutmos / prom_ex

An Elixir Prometheus metrics collection library built on top of Telemetry with accompanying Grafana dashboards
MIT License
596 stars 104 forks source link

[BUG] Cached metrics makes us crazy #234

Closed ViseLuca closed 1 month ago

ViseLuca commented 4 months ago

Hi, we have a system with some cursors where we are calculating metrics on how much delays cursors have from last events emitted. We have built a custom plugin for this but the system sometimes it remains stuck with this values in cache and is not going ahead since we restart the kubernetes pods.

What we can do for that?

akoutmos commented 4 months ago

Hello Luca!

Unfortunately I don't have enough information to help you with this one. Could you produce a repo to minimally reproduce the issue that you are seeing? Do you think it is a problem with the plug that you created as opposed to PromEx the library? A shot in the dark guess would be that you want to use a gauge (via last_value https://hexdocs.pm/telemetry_metrics/Telemetry.Metrics.html#last_value/2) in your plug in if you just want the last value of some measurement.

Happy to help, but need some more information :).

ViseLuca commented 4 months ago

We have an eventstore with events and cursors:

so a table like: Table: Events event_name | offset | Event1 | 1 | Event1 | 2 | Event2 | 3 |

Every cursor has a set of event types that could read to process them.

Table: Cursors name | event_types | last_offset_read | Cursor1 | [Event1] | 0 | Cursor2 | [Event2] | 0 | Cursor3 | [Event1, Event2] | 0 |

We are checking with our plugin how many events are missing to reach the end. So in this case for example the metrics would be: promex_cursor_1 2 (Event1 types) promex_cursor_2 1 (Event2 types) promex_cursor_3 3 (Event3 types)

We are querying the DB with a query to check how many events are still available after the last_offset_read. Sometimes the metric is coming back with the same data for a while and then changing. The problem is that we have alert on those metrics and sometimes we have false positive alerts.

Do you have any idea about it?

The problem is that locally is not happening, meanwhile in staging with all the cursors running (~50 cursors) sometimes the metrics are stuck or totally absent. I'm polling every 10 seconds.

I am already using last value

defp cursor_delay_metrics(metric_prefix, poll_rate) do
    Polling.build(
      :cursor_delay_polling,
      poll_rate,
      {__MODULE__, :cursor_delay_metrics_metrics, []},
      [
        last_value(
          metric_prefix ++ @event_delay_cursor,
          event_name: @event_delay_cursor,
          description: "The number of events that all the cursor must process to be aligned",
          measurement: :cursor_delay,
          tags: [:cursor_name]
        )
      ]
    )
  end

meanwhile this is the function that calculates the metrics:

def cursor_delay_metrics_metrics do
    CursorSchema
    |> Repo.all()
    |> Enum.map(fn %{id: id, name: name} ->
      {:ok, count} = Cursors.count_events_not_processed_yet(id)

      {name, count}
    end)
    |> Enum.map(fn {name, count} ->
      :telemetry.execute(
        @event_delay_cursor,
        %{cursor_delay: count},
        %{
          cursor_name: name
        }
      )
    end)
  end
ViseLuca commented 4 months ago

Sometimes also there are few metrics, missing the cursors one.

I am thinking: we have 3 pods on k8s, is possible that the PromEx process is just starting on one of that 3 and the call is succeed just on 1 of the 3?

fedme commented 4 months ago

@ViseLuca could it be related to this issue I just opened https://github.com/akoutmos/prom_ex/issues/236?

We are observing the same thing and I have pinpointed it to errors thrown from within the mfa callback that collects the metric in the plugin.

ViseLuca commented 4 months ago

@fedme it could be, the query on DB was going on timeout sometimes so was raising and error and being stuck for the same reason. It fits technically

akoutmos commented 1 month ago

Closing this ticket for now as a release will be cut soon with the ability to not detach the polling job when an error is encountered (example in #236).