defenseunicorns / uds-core

A FOSS secure runtime platform for mission-critical capabilities
https://uds.defenseunicorns.com
GNU Affero General Public License v3.0
51 stars 21 forks source link

Add Default Grafana Dashboards #207

Closed mjnagel closed 4 months ago

mjnagel commented 8 months ago

Is your feature request related to a problem? Please describe.

Currently UDS Core deploys without many/any default dashboards in grafana. It would be beneficial if I could see some dashboards out of the box for basic information.

Describe the solution you'd like

Based on user feedback, add minimal, clean dashboards that are valuable to end users.

Describe alternatives you've considered

End user is able to create these dashboards themselves, but this could be difficult in airgap to pull from remote dashboards, etc.

mjnagel commented 8 months ago

@blancharda @ntwkninja @docandrew would be great to get some insight on what is valuable to you all as end users. I have heard this list before:

Not sure if there are other valuable pieces.

docandrew commented 8 months ago

Storage/PV use is definitely critical for us. Even though the Elastic stack isn't part of UDS Core, having dashboards available for when Elastic is deployed alongside UDS Core would be a valuable thing for us as well: https://grafana.com/grafana/dashboards/878-elasticsearch-dashboard/

mjnagel commented 8 months ago

@docandrew storage/PV is a good callout - are there dashboards you're currently using for that (published in grafana's site or otherwise).

I don't think we'd want to include elastic dashboards in uds-core, but we do already enable auto-adding dashboards from a configmap (see this example from loki that gets pulled in here). That might be something where your separate zarf package for elastic could include a configmap similarly to load those into grafana. We'd likely take a similar approach with other UDS Packages like gitlab, etc - if dashboards are needed for those they would be in those specific zarf packages rather than core. Helps to keep the core baseline slimmer and keep us from adding lots of conditional pieces based on what you deploy on top.

docandrew commented 8 months ago

We've run into issues with whether Grafana is using the "sidecar provider" for dashboards vs auto-adding others from configmaps. We'll just need to make sure that the configmaps for user-added dashboards have the correct annotation so the sidecar provider can pick those up (if that's how its being used)

docandrew commented 8 months ago

I can't speak to specific dashboards just yet that we're using to monitor storage, but will try and dig a bit to see what's useful.

mjnagel commented 6 months ago

Updating on current status - https://github.com/defenseunicorns/uds-core/pull/256 introduced some of the default dashboards from the upstream chart. That should address some of the key asks for:

I think I'm going to let that one roll out and see if we can solicit feedback on other things people may be looking for before introducing others.

mjnagel commented 4 months ago

Haven't heard any clear feedback yet - @blancharda and @docandrew have you all had a chance to deploy and see if you find any dashboards lacking? I know there's some changes coming with UDS Engine to provide policy + package dashboarding so don't believe we have plans to add those two pieces to Grafana.

docandrew commented 4 months ago

I haven't had the opportunity to look again - will try redeploying it and poking around as soon as I get some spare cycles, thanks for all the work on this!

blancharda commented 4 months ago

I'm fairly satisfied in terms of dashboard content for the moment -- if anything there are more than we probably need. The ones I use most frequently are definitely the compute resource dashboards for cluster and namespace (pod).

The networking info is nice to have when troubleshooting, and I'm sure the Loki dashboards will be useful as we attempt to tune/size our installation -- but we could probably narrow down the list in all categories.

I would note that we run into resource issues pretty frequently though. Some amount of it is obviously environment specific.. but it still may be worth bumping the defaults for prom and Grafana.

mjnagel commented 4 months ago

Going to tentatively close this ticket out, if anyone comes across new needs or asks to remove dashboards feel free to open follow-ons and link this original issue. Also would welcome a specific issue on that resource problem @blancharda - I think we've encountered some issues with prometheus in our staging environment so that one definitely seems like a good first one to bump up.

blancharda commented 4 months ago

Tossed up #551 to start the discussion 👍