dora-metrics / pelorus

Automate the measurement of organizational behavior
https://pelorus.readthedocs.io/
Apache License 2.0
245 stars 83 forks source link

Make Pelorus Work with OpenShift User Workload Monitoring #364

Open eformat opened 2 years ago

eformat commented 2 years ago

its would be a very common requirement that User Workload Monitoring and pelorus are deployed in the same cluster.

When user workload monitoring is deployed:

For reference we do this already for TL500 tech exercises which is based on this doc, code links

So could be possible to refactor the approach in pelorus as well.

etsauer commented 2 years ago

relates to #369

etsauer commented 2 years ago

@eformat FYI, this won't be a quick fix, but we are talking about/considering removing the upstream operators from Pelorus and just managing individual deployments of prometheus/grafana.

eformat commented 2 years ago

@etsauer ah right. so not touching/interacting with user workload monitoring at all?

yeah one thing that struck me is i really want the delivery metrics from pelorus to be long lived (months, year) whereas the thousands of platform metrics not so much. so one issue with surfacing those through user workload metrics is there is no way to do that easily? with them separate you can offload them to longer term storage that thanos sit on top of.

etsauer commented 2 years ago

@eformat yeah, the way we use prometheus for Pelorus is completely different, and I would say incompatible with, a traditional monitoring stack. We've had a lot of questions in the past about whether a user could just use the existing openshift-monitoring prometheus instance(s) for Pelorus, and to me that seems insane, even if you ignore the support issues that would come with that. I want to view prometheus in the pelorus context as simply a standalone application datastore. We could theoretically be using mongo, cassandra, etc. (and maybe will in the future), but I just like what promql gives us right now in terms of doing aggregations over time. But I don't want it to be thought of as a monitoring component.

The fact that the community prometheus operator has some issue that causes it to interfere with OpenShift's monitoring stack is really unfortunate for us. I'm not sure what to do about that yet, as I'm not even sure what exactly the issue with it is or if its fixable in the upstream operator.

eformat commented 2 years ago

yes, the data retention requirements are at the heart of it. ideally we'd want short term metrics in prometheus, longer term somewhere else (s3 ?) and Thanos on top. you can configure retention for user workload monitoring prometheus separately today, so none of that UWM stack interferes with the platform monitoring stack.

image

this seems to work just fine with pelorus exporters from my experimenting. i have not tried configuring external writes to s3 say, with user workload monitoring .. might be nice to see if thats doable.

yeah, why there is an openshift fork of prometheus etc .. might be based on the original coreos code by the looks of it, which ran a forked version i assume for good reason. we could find out i guess.

the path of least resistance today may be to use the user workload monitoring stack some more with pelorus for supported installations rather than re-engineer the whole metrics storage. would need to change how it is installed i.e. helm chart changes or a new helm chart just for UWM installs.

so the opposite of what you are suggesting i guess - however, i agree with you, it may be better strategically/longer term to have it all separately consumable which comes back to the use and clash with prometheus.

eformat commented 2 years ago

on this .. i have found that ocp4.10 is going to support remote_write for prometheus this is great, so we can store long term metrics from perlous if using user workload monitoring

https://docs.openshift.com/container-platform/4.10/monitoring/configuring-the-monitoring-stack.html#configuring_remote_write_storage_configuring-the-monitoring-stack

a working PoC is here https://github.com/eformat/thanos-s3

mpryc commented 1 year ago

Related discussion: https://github.com/konveyor/pelorus/discussions/809

KevinMGranger commented 1 year ago

We need to look into this to see: