[RFE] Day-2 operations - Deploy pyroscope

cloud-bulldozer / scale-ci-deploy

Automation for OpenShift Deployments - install, scaling and upgrades

Apache License 2.0

15 stars 36 forks source link

[RFE] Day-2 operations - Deploy pyroscope #214

Open rsevilla87 opened 1 year ago

rsevilla87 commented 1 year ago

Pyroscope is an interesting tool that can be very useful when reporting/troubleshooting a low level performance issue. Deploying and configuring it as a 2nd day-op once the cluster is ready would be great.

For the moment, I'd configure pyroscope to scrape these components, which already expose the /pprof endpoints by default:

kube-apiserver (Bearer token authentication)
kube-controller-manager (Bearer token authentication)
etcd (certificate authentication)

We can also consider scraping other core components like (optionally ? ):

ovnkube-master
ovnkube-node
kubelet
cri-o
kube-scheduler
operator-lifecycle-manager
openshift-apiserver
openshift-kube-controller-manager
prometheus

Prior start developing this RFE, we should investigate how much space requires pyroscope, and determin the overhead (if relvant) it adds to these components

jtaleric commented 1 year ago

With our move to prow, does it make sense to put this in scale-ci-deploy? Should we start abstracting these sort of day-2 operations in e2e-benchmarking or similar?

rsevilla87 commented 1 year ago

With our move to prow, does it make sense to put this in scale-ci-deploy? Should we start abstracting these sort of day-2 operations in e2e-benchmarking or similar?

that's right, not sure if e2e-benchmarking would be the right place though, as is a "benchmarking" repo and not a day-2 one.

Apart from that, this project still does lot of day-2 operations and it's has been updated recently (for example https://github.com/cloud-bulldozer/scale-ci-deploy/pull/213 and https://github.com/cloud-bulldozer/scale-ci-deploy/pull/211), if we're going to make that movement we should consider moving all the current day-2 operations performed here as well

jtaleric commented 1 year ago

Well #213 and #211 are really "hacks" to allow for us to test things like OVNIC or newer bits of OVN in our CI - not sure we want to conflate those steps w/ "Day 2 Operations".

that's right, not sure if e2e-benchmarking would be the right place though, as is a "benchmarking" repo and not a day-2 one.

Agreed... Not sure it is the right place.

Maybe we should consider a new repo for "Day 2" Operations?

smalleni commented 1 year ago

Back in the Telco days, that's the path we went down.

We would deploy using JetSKi and then have all the day-2 config (operators, logging stack setup, BigIP config) all encapsulated in https://github.com/redhat-performance/webfuse