Closed MichaelClifford closed 2 years ago
I'd be glad to assist on this. :+1:
@Shreyanand, @oindrillac did you managed to take a look a this already? :slightly_smiling_face: We've just hit this ticket on our sprint planning today, so I'm checking with you.
@tumido, @oindrillac this comes at the perfect time, I started looking at it last week and have planned to do an EDA notebook by next week. I'll create a WIP PR with the notebook asap and then maybe we could set some time and discuss the kind of questions we'd like to answer? WDYT?
Updates from meeting with @tumido:
1) We can use the metric values pod:container_memory_usage_bytes
and pod:container_cpu_usage
- these metrics also return aggregate values. ref: https://github.com/operate-first/apps/blob/00c730349941fa0f413dbcf21e2cb9f167f60cab/kfdefs/overlays/moc/zero/opf-monitoring/dashboards/odh/jupyterhub-user.yaml#L799
2) We need to work on an analysis primarily from a CPU standpoint, since currently we have limited CPU resource.
3) Note: Granularity for CPU analysis can be in decimals. Core usage can be < 1 eg: 0.8
4) To decide "Request" sizing, we can analyze average off-peak usage across all pods and plot histograms for 4-5 categories of usage.
5) To set "Limits", we can get data on average peak usage across pods.
6) Remember to filter out for Default pods ( filter by where limit & request = NaN/0 but there is usage).
Tiers available at this JH instance.
@tumido has identified that the default resources requests for the 4 tiers of Jupyterhub pod configurations do not correspond well to actual usage. This has lead to over commitment of cluster resources that aren't actually being used and impacted cluster performance. See this issue for further details: https://github.com/operate-first/blueprint/issues/83
There is an opportunity for significant improvement in overall cluster performance if we can redefine the pod resource tiers based on actual usage by our data science team.
We should use the available Prometheus metrics for the JH pods, giving special attention to the metrics shown on this dashboard, to provide a data driven recommendation for the optimized resource tiers.
Acceptance Criteria: