Closed zonca closed 3 years ago
@pibion have you tested this? any feedback?
ok, I also setup HTTPS for Grafana, it seems to be working fine, closing this.
@zonca finally taking at look at this. Do you think it's possible to make this a way to reliably estimate allocation use, per the discussion we've had with support?
Also, would you be okay with me putting a pointer to the secrets repo in this repo's README?
I think it's should already be there but for sure you can make it more prominent
So if I take the "Kubernetes / Compute Resources / Namespace (Pods)" CPU information and integrate? And then convert to SU?
No, that is Kubernetes level, all of the Kubernetes pods run on 2 Jetstream nodes, 1 medium and 1 xlarge so we have constant burn rate
Ah okay. So is it accurate to say that the kubernetes burn plus the constant burn rate from the medium and large Jetstream nodes gives us the CPU usage?
I'm trying to figure out how to answer the "How I'll monitor resource usage and users" section in the XSEDE allocation request. Users I can see from the admin console of JupyterLab. But since the XSEDE user portal usage shouldn't be trusted for Jetstream I'm looking for a replacement.
I control resource usage by deciding how many Virtual Machine I deploy. Now I have 2, and they have a constant usage of:
(24 for xlarge + 6 for medium) cores 24 hours/day 31 days = 22320 SUs
Ah okay, so if I track the number and type of virtual machines then I can calculate the SU usage. And I can track that through the grafana interface?
oh that is a good idea! you can check the Nodes panel, it tells you the number of CPUs:
Ooh is there a constant burn rate for a single CPU?
If so then maybe I could make some kind of report that tracks how many CPUs are running.
yes, 1 SU means 1 single CPU per hour, a medium instance has 6 CPUs a XLarge has 24, so:
(24 for xlarge + 6 for medium) cores 24 hours/day 31 days = 22320 SUs
grafana only retains 24h, I found another way, sorry I didn't think about this before:
Oh, geez, I should have thought of this too. Are VCPU-hours the same as SU?
Yes
Okay, awesome, I'm now reasonably confident I can accurately determine our remaining allocation. I just need to keep track of the allocation start date and initial SUs.
redeploying after #48, strange error in cert-manager deploying the ingress:
E0227 02:11:43.205194 1 controller.go:158] cert-manager/controller/certificaterequests-issuer-acme "msg"="re-queuing item d
ue to error processing" "error"="Operation cannot be fulfilled on certificaterequests.cert-manager.io \"certmanager-tls-grafana-7
8rpf\": the object has been modified; please apply your changes to the latest version and try again" "key"="monitoring/certmanage
r-tls-grafana-78rpf"
I didn't read my own docs to the end, I need to create another issuer, see bottom of https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html
I have deployed Prometheus and Grafana, this is useful to check the health of the production deployment.
I followed and updated the tutorial at https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html
See the README in the secret repository for instructions on how to access: https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets/blob/master/README.md
Some example dashboards
the node dashboard continues below