Closed simondeziel closed 1 year ago
Given that the lxd databag is empty, this must be happening in the provider end (i.e. lxd). Prometheus would not be allowed by Juju to wipe data from the remote app databag.
Can you point me to the code that generates your scrape targets? Likely, it is being run multiple times with different state, leading to an empty databag.
Can you point me to the code that generates your scrape targets? Likely, it is being run multiple times with different state, leading to an empty databag.
Possibly! I just double checked and self.metrics_endpoint.update_scrape_job_spec(jobs=jobs)
is called only once by lxd_update_metrics_endpoint_scrape_job
which logs a message when it does and I only see it once.
The PoC charm is available at https://sdeziel.info/pub/lxd_ubuntu-22.04-amd64.charm. Thanks for looking into this, it's much appreciated!
@simskij Important point I forgot to mention, the LXD charm is a machine one.
I haven't quite had a chance to reproduce this yet, but I did have a chance to go through the charm code after unpacking it.
There is a lookaside_jobs
constructor arg which takes a Callable
which is added onto the list. MetricsEndpointProvider(..., lookaside_jobs_callable=self.lxd_update_metrics_endpoint_scrape_job)
is probably a viable workaround.
That said, this is a bug in our code. Every time any event fires for any reason, the constructor is re-invoked, and it clobbers it as the last thing it does. It was introduced as part of this PR, but was an unreliable, hacky solution to event ordering and relying on external_url
being updated by an ingress relation changed after it was already used in the constructor.
Multiple patches have completely eliminated the need for this, and it should be removed here also.
@rbarry82 many thanks, and yes, I'm now using the lookaside_jobs
which avoids the resetting behaviour. I'm still trying to find a way to not have the default scrape_jobs
appended to the list but at least, lookaside_jobs
seems to be the right way for us.
I'll look into this comes Monday and will update this issue. Thanks again!
Happy to help! Admittedly, that was added for a different kind of discovery, but I'm glad it works here.
In general, we kind of don't want to support "arbitrary" endpoint discovery straight to Prometheus (HTTP-based, k8s service discovery, or the other "usual" Prometheus methods) because they're not able to be represented in Juju models in any meaningful way, which makes it hard/impossible to export/import models.
However, I don't really think the initial design considered the possibility of a client where we'd want to have a step-between use case. There's prometheus-scrape-target-k8s-operator
, but that's intended to add arbitrary/non-Juju endpoints (not ones which are dynamically built), and it's something we could/should support.
Arbitrarily, if lookaside_jobs_callable
is truthy and jobs
is falsy, this is an easy conditional (it would also need to be initialized to DEFAULT_JOB
, but that's easy for you to override with None
or []
or whatever), but it's something we should do anyway. @simskij thoughts?
Bug Description
We have a charm that deals with LXD clusters. The intent is to have the app leader (
lxd/8*
in this example) send the desiredscrape_jobs
to Prometheus. We only want the leader to send thescrape_jobs
because it knows whichtargets
are ready to be scrapped.This works initially where Prometheus receives this:
Which translates to this config:
However after some time where nothing's done by the operator/admin, the following happens:
Which causes Prometheus to loose the
scrape_jobs
:Which causes the Prometheus config to be reverted:
To Reproduce
juju deploy -m test ./lxd_ubuntu-22.04-amd64.charm --config lxd-listen-https=true --config mode=cluster
juju integrate -m test lxd:metrics-endpoint prometheus-scrape:metrics-endpoint
Environment
cos
model:test
model:Relevant log output
Additional context
It's quite possible I'm using the prometheus-k8s
prometheus_scrape
lib wrong... help pointing out where I'm using it wrong would be greatly appreciated!