autometrics-dev / autometrics-py

Easily add metrics to your code that actually help you spot and debug issues in production. Built on Prometheus and OpenTelemetry.
https://autometrics.dev
Apache License 2.0
214 stars 7 forks source link

INVESTIGATE: Setting build info with `init` can cause duplicate build_info gauges that break autometrics queries #88

Open brettimus opened 11 months ago

brettimus commented 11 months ago

When you set, e.g., a version via init, then I'm pretty sure we end up adding two gauges, one with a version label and one without a version label. These will both be set within a short period of time.

The result is that our group left query that joins build info to function metrics will cause a 422 from prometheus, with an error like:

Error executing query: 
found duplicate series for the match group {instance="localhost:8082", job="am_0"} on the right hand-side of the operation: 

[
  {__name__="build_info", instance="localhost:8082", job="am_0", service_name="autometrics", version="0.0.1"},
  {__name__="build_info", instance="localhost:8082", job="am_0", service_name="autometrics"}
];

many-to-many matching not allowed: matching labels must be unique on one side

The culprit is likely:

def default_tracker():
    """Setup the default tracker."""
    preferred_tracker = get_tracker_type()
    return init_tracker(preferred_tracker)

tracker: TrackMetrics = default_tracker()

We initialize a tracker out of the box. When a user calls init, they effectively "re-initialize" the tracker with new build information, which sets a new build info gauge.

Need to confirm though.

brettimus commented 11 months ago

So this might end up being pretty edge-case-y.

The original scenario that reproduced this for me was:

My working theory is that this issue will only occur if you call init with a version, without also changing the tracker.

In such a case, the tracker would be recording two separate build_info metrics within a short period of time (less than the interval we use in our queries)

It begs the question: Could we use a concept of a "clearmode" internally on the build_info gauge?

Basically, last-in-wins for build information. If you record a build_info metric, it will remove any other previously recorded build_info metrics that your app is exporting.

This could pose other issues (I haven't fully thought this through yet), but it would make a stronger guard at the library level that against corrupting our build info queries.