aicoe-aiops / operate-first-jupyterhub-analysis

Operate first Jupyterhub application provides a Jupyter notebook environment for python development and running machine learning workloads. In this repository, we analyze the data generated from the application hosted on the Operate First cluster on Openshift.
Other
2 stars 3 forks source link

Optimize set of Jupyterhub pod resource requirement options based on team usage #12

Closed MichaelClifford closed 2 years ago

MichaelClifford commented 3 years ago

@tumido has identified that the default resources requests for the 4 tiers of Jupyterhub pod configurations do not correspond well to actual usage. This has lead to over commitment of cluster resources that aren't actually being used and impacted cluster performance. See this issue for further details: https://github.com/operate-first/blueprint/issues/83

There is an opportunity for significant improvement in overall cluster performance if we can redefine the pod resource tiers based on actual usage by our data science team.

We should use the available Prometheus metrics for the JH pods, giving special attention to the metrics shown on this dashboard, to provide a data driven recommendation for the optimized resource tiers.

Acceptance Criteria:

tumido commented 3 years ago

I'd be glad to assist on this. :+1:

tumido commented 3 years ago

@Shreyanand, @oindrillac did you managed to take a look a this already? :slightly_smiling_face: We've just hit this ticket on our sprint planning today, so I'm checking with you.

Shreyanand commented 3 years ago

@tumido, @oindrillac this comes at the perfect time, I started looking at it last week and have planned to do an EDA notebook by next week. I'll create a WIP PR with the notebook asap and then maybe we could set some time and discuss the kind of questions we'd like to answer? WDYT?

oindrillac commented 3 years ago

Updates from meeting with @tumido: 1) We can use the metric values pod:container_memory_usage_bytes and pod:container_cpu_usage - these metrics also return aggregate values. ref: https://github.com/operate-first/apps/blob/00c730349941fa0f413dbcf21e2cb9f167f60cab/kfdefs/overlays/moc/zero/opf-monitoring/dashboards/odh/jupyterhub-user.yaml#L799 2) We need to work on an analysis primarily from a CPU standpoint, since currently we have limited CPU resource. 3) Note: Granularity for CPU analysis can be in decimals. Core usage can be < 1 eg: 0.8 4) To decide "Request" sizing, we can analyze average off-peak usage across all pods and plot histograms for 4-5 categories of usage. 5) To set "Limits", we can get data on average peak usage across pods. 6) Remember to filter out for Default pods ( filter by where limit & request = NaN/0 but there is usage).

Shreyanand commented 2 years ago

Tiers available at this JH instance.