fluxcd / flux2-monitoring-example

Prometheus monitoring for the Flux control plane
https://fluxcd.io/flux/monitoring/
Apache License 2.0
48 stars 127 forks source link

No data on lots of metrics #35

Open LeGmask opened 4 months ago

LeGmask commented 4 months ago

image

Lots of the metrics show no data, is this only me or a configuration issue ?

DerRockWolf commented 4 months ago

Probably only you and probably a configuration error 🙃

razvanphp commented 2 months ago

You probably need to run kube-state-metrics-config.

I'm also here to add that those metrics (reconciliation status of all resources) should probably be exposed by default. Also, not everybody needs grafana, prometheus and loki, some of us have those already, we just need the dashboards. For reference, I use TrueNAS + TrueCharts.

Can we somehow contribute this to the readme and maybe split the repo in a way that we can install some parts of it?

kingdonb commented 2 months ago

We have Flux Bug Scrubs where the primary authors and maintainers of the monitoring config are typically in attendance, if you want to propose a PR, or discuss a failure mode that you think we could solve with a new structure, we're very friendly and we will be glad to look at your idea.

The meeting is 1.5x for each Flux Dev Meeting, I think this sounds like a Bug Scrub topic more than a Dev Meeting topic, it would be great to have anyone present your problem or idea there. We can look at your config and try to help you resolve common issues, I would not consider this a major imposition on our time even if it turns out to be your particular issue, since it helps us as docs authors to understand what are the common issues.

https://fluxcd.io/community#meetings

razvanphp commented 2 months ago

Cool, I'll join next Wednesday!

Just for reference, I will share here my pain-points, maybe those can serve as Agenda for discussions.

Proposals:

  1. include gotk_resource_info as default metrics in the controllers, so people can easily monitor flux with existing grafana installations (including grafana-cloud)
  2. split the stacks into smaller parts, so that one can install just the parts needed, switching to a different loki chart, including kube-state-metrics separately as well.
  3. bump kube-state-metrics chart to version 5.8.0, so that configmap is created by the chart, cleaning the configMapGenerator over-complication.

Thank you for reading this! R

kingdonb commented 1 month ago

OK, thanks @razvanphp - we took some of your feedback during bug scrub this week, and we brought it back to the dev meeting, although it happened after the recording, so our discussion was not captured.

Basically two or three points I can synthesize from that about how our docs can improve, there are a few other points about our examples, but I think the example will mostly stay how it is now, with only minor changes. It is all-in-one on purpose.

The problem with adding multiple paths in the docs is that we already have trouble keeping up with kube-prometheus-stack and examples will quickly grow stale when you don't test them. So who is going to test and maintain multiple examples of use cases intended for different people? We should not make any maintainers' job so big that one person cannot do it all. (And if there is separate configuration for kube-state-metrics, how will we ensure that we keep multiple examples in sync?)

Hopefully kube-state-metrics won't have breaking changes in the future, maybe this section does not change and concerns are overblown, but who knows?

I think we can make some minimal changes to support the stand-alone kube-state-metrics use case directly in the docs without going afoul of these issues, but yours (and my) complaint extends a bit deeper. There is a big gap in the Prometheus docs themselves, they are not super accessible, and they do not pick up where Flux docs leave off in any meaningful sense. Perhaps this need should be served by some sort of company that provides Prometheus training as a core goal. I think the Flux example does not go far enough, for the person who is in a green field and setting Prometheus up from scratch.

But we are not that company to provide Prometheus training... still we should be able to provide complete examples, including everything you need to monitor Flux, and how to install those alerting rules, dashboards, etc. - you should not be left on your own to pick it apart from the all-in-one example, it should be straightforward to add a PrometheusRule for alert or Grafana dashboard to an existing installation of both, we can provide these as easy instructions for an existing installation.

Anyway, we concluded at the dev meeting with some discussion that these things may not all belong in the Flux docs, but a blog post instead. Blog posts are allowed to stray from project core goals, blog posts can get stale and nobody sees it as bit-rot. So that's the best place to provide this much detail, and as I have lots of experience with this pain point personally, I volunteer to write the post.

Briefly, I do not find the Prometheus ancillary tools docs very accessible, tl;dr: I went looking for AlertManager docs, found https://prometheus.io/docs/alerting/latest/overview/ and clicked through to the rest, I'm getting strong "draw the rest of the owl" from these docs and I do not know how one is supposed to get from the Flux monitoring guide as a beginner to the next step that should come after it, where monitoring is configured with some basic alerts to notify you when things go wrong.

And I have configured alertmanager before (it's serviceable, it works, but it is ugly) - this is partly a Helm issue because merging arrays in values.yaml is not possible. So I wound up with a full copy of the default values.yaml with all the default alerts in it, and it has surely diverged from upstream by now. I did my best, for a first attempt; I wouldn't do it that way again.

This is not only about kube-state-metrics but it is the same thread. It's about configuration of Prometheus ecosystem tools after day 1. Some users may not want to use the umbrella chart, some users may not use Helm values for the configuration at all. @swade1987 linked me a post from Mettle that goes even further (skipping over the CRDs as well, to entirely do the configuration with all ConfigMaps and way more GitOps... https://swade1987.medium.com/monitoring-mettle-59bbbb408ff)

I think I can use CRDs to do most of the configuration, rather than Helm; maybe we analyze the configuration as helm values and then show the resulting configuration as flat YAML to show how Prometheus is configured to scrape, for understanding?

And similarly, there is no straightforward path to install Flux grafana dashboards without copy-pasting a bunch of JSON from the monitoring guide. The marketplace makes this easy, if you publish in there, but we haven't. (Then someone will need to maintain the dashboard there. Maybe we can ask the person who already posted a flux2 dashboard to update it, or hand over control of the existing entry so we can update it.) I'm not prepared to do all that now, but it could be part of this same effort. The blog should show all of how to add alerts, grafana dashboards, scraped metrics, anything else I forgot for day-2?

Let's walk people through the all-in-one monitoring example in brief, point out the things it already covers well, and discuss the handful of different ways to configure a Prometheus installation (either through values, through CRDs, or through configmaps) then pick one that's a bit divergent from what the docs already show to provide a new complete example.

Since the metrics from kube-state-metrics are so pivotal, we can provide an example installation of KSM that stands alone and show how to connect it to the metrics collectors, (I think I'll actually use victoria metrics for this, because I heard it is fully compatible with Prometheus; it's something that people might actually want, and moreover it won't be fully redundant compared to what we already show. Plus it happens to be on my way now... it is the metrics aggregator used by default in Cozystack, I am not choosing it from my own experience, so in truth I do not know at all yet how similar it will actually be.)

After having written all that out, it definitely sounds like at least two different blog posts.

Anyway, if you (anyone) want to help with this, it'll be work-in-progress for a few weeks while I figure out my own mess. I'll post a PR on fluxcd/website soon and link it back here.

stefanprodan commented 1 month ago

The Flux Operator exposes metrics for all Flux resources without the need for kube-state-metrics nor Prom Operator. I plan to create a dashboard for it that should work with plain Prometheus or any of the managed Prom services from AWS, GCP, Azure.