det-lab / jupyterhub-deploy-kubernetes-jetstream

CDMS JupyterHub deployment on XSEDE Jetstream
0 stars 1 forks source link

Cluster monitoring with Prometheus and Grafana #40

Closed zonca closed 3 years ago

zonca commented 4 years ago

I have deployed Prometheus and Grafana, this is useful to check the health of the production deployment.

I followed and updated the tutorial at https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html

See the README in the secret repository for instructions on how to access: https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets/blob/master/README.md

Some example dashboards

image the node dashboard continues below image

image

zonca commented 4 years ago

@pibion have you tested this? any feedback?

zonca commented 4 years ago

ok, I also setup HTTPS for Grafana, it seems to be working fine, closing this.

pibion commented 3 years ago

@zonca finally taking at look at this. Do you think it's possible to make this a way to reliably estimate allocation use, per the discussion we've had with support?

pibion commented 3 years ago

Also, would you be okay with me putting a pointer to the secrets repo in this repo's README?

zonca commented 3 years ago

I think it's should already be there but for sure you can make it more prominent

pibion commented 3 years ago

So if I take the "Kubernetes / Compute Resources / Namespace (Pods)" CPU information and integrate? And then convert to SU?

zonca commented 3 years ago

No, that is Kubernetes level, all of the Kubernetes pods run on 2 Jetstream nodes, 1 medium and 1 xlarge so we have constant burn rate

pibion commented 3 years ago

Ah okay. So is it accurate to say that the kubernetes burn plus the constant burn rate from the medium and large Jetstream nodes gives us the CPU usage?

I'm trying to figure out how to answer the "How I'll monitor resource usage and users" section in the XSEDE allocation request. Users I can see from the admin console of JupyterLab. But since the XSEDE user portal usage shouldn't be trusted for Jetstream I'm looking for a replacement.

zonca commented 3 years ago

I control resource usage by deciding how many Virtual Machine I deploy. Now I have 2, and they have a constant usage of:

(24 for xlarge + 6 for medium) cores 24 hours/day 31 days = 22320 SUs

pibion commented 3 years ago

Ah okay, so if I track the number and type of virtual machines then I can calculate the SU usage. And I can track that through the grafana interface?

zonca commented 3 years ago

oh that is a good idea! you can check the Nodes panel, it tells you the number of CPUs:

image

image

pibion commented 3 years ago

Ooh is there a constant burn rate for a single CPU?

If so then maybe I could make some kind of report that tracks how many CPUs are running.

zonca commented 3 years ago

yes, 1 SU means 1 single CPU per hour, a medium instance has 6 CPUs a XLarge has 24, so:

(24 for xlarge + 6 for medium) cores 24 hours/day 31 days = 22320 SUs

zonca commented 3 years ago

grafana only retains 24h, I found another way, sorry I didn't think about this before:

https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets/commit/02749029ca9cb48218783aa42881a8fdad53f02c

pibion commented 3 years ago

Oh, geez, I should have thought of this too. Are VCPU-hours the same as SU?

zonca commented 3 years ago

Yes

pibion commented 3 years ago

Okay, awesome, I'm now reasonably confident I can accurately determine our remaining allocation. I just need to keep track of the allocation start date and initial SUs.

zonca commented 3 years ago

redeploying after #48, strange error in cert-manager deploying the ingress:

E0227 02:11:43.205194       1 controller.go:158] cert-manager/controller/certificaterequests-issuer-acme "msg"="re-queuing item d
ue to error processing" "error"="Operation cannot be fulfilled on certificaterequests.cert-manager.io \"certmanager-tls-grafana-7
8rpf\": the object has been modified; please apply your changes to the latest version and try again" "key"="monitoring/certmanage
r-tls-grafana-78rpf"
zonca commented 3 years ago

I didn't read my own docs to the end, I need to create another issuer, see bottom of https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html