aicoe-aiops / operate-first-jupyterhub-analysis

Operate first Jupyterhub application provides a Jupyter notebook environment for python development and running machine learning workloads. In this repository, we analyze the data generated from the application hosted on the Operate First cluster on Openshift.
Other
2 stars 3 forks source link

Consider quotas as used from admin perspective #22

Open Shreyanand opened 2 years ago

Shreyanand commented 2 years ago

Based on the following slack discussion and the issue faced on the smaug cluster, we need to think of improvements to the current resource allocation policy.

The current approach just looks at the usage patterns of the Jupyterhub application. What happens when we have multiple services competing for the same cluster resources? How do we model the effect of cluster level quota restriction on the user level profile recommendation?

Aakanksha Duggal wrote: Hello team, I am getting an error while spawning a large kernel for the elyra image. Are we facing some issues today?

Anand Sanmukhani wrote: it seems like we need to increase the quota for cpu limits

1st Operator wrote: This post/thread was captured in an issue: https://github.com/operate-first/support/issues/487|operate-first/support#487

Anand Sanmukhani wrote: Humair Khan https://github.com/operate-first/apps/pull/1389

Humair Khan wrote: 40 cores!?

Anand Sanmukhani wrote: This is what it looks like currently

Humair Khan wrote: why is there such a huge deviation from request and limit

Anand Sanmukhani wrote: I think that's how the resource profiles are set up

Humair Khan wrote: https://grafana.operate-first.cloud/d/9iXbjpD7k/namespace-cpu-and-memory-overview?orgId=1&from=now-7d&to=now&var-datasource=openshift-monitoring&var-namespace=opf-jupyterhub&var-cluster=moc%2Fsmaug|https://grafana.operate-first.cloud/d/9iXbjpD7k/namespace-cpu-and-memory-overview?orgId=1&[…]ng&var-namespace=opf-jupyterhub&var-cluster=moc%2Fsmaug

Humair Khan wrote: this looks like such a waste

Anand Sanmukhani wrote: idk it looks good to me

Anand Sanmukhani wrote: requests and usage are pretty close to each other

Humair Khan wrote: we should consider quotas as used from admin perspective

Humair Khan wrote: well let's okay this I guess, we'll need to revisit this

Humair Khan wrote: at this rate JH will be 90% of the cluster quotas

Humair Khan wrote: what is the highest limit on a profile

Humair Khan wrote: 8cores?

Anand Sanmukhani wrote: hmm I think this fine, since increasing the limits does not make the cpu unavailable to other workloads

Humair Khan wrote: we should note care about that, we should care about resource quotas in relation to cluster capacity

Humair Khan wrote: quotas should be, from our perspective "used" cpu

Humair Khan wrote: if quotas surpass cluster capacity, then you risk repeating the zero cluster situation

Anand Sanmukhani wrote: I think Shrey Anand was looking into updating these resource quotas?

Humair Khan wrote: the resource quotas or the profiles?

Anand Sanmukhani wrote: profiles*

Anand Sanmukhani wrote: here is the issue: https://github.com/aicoe-aiops/operate-first-jupyterhub-analysis/issues/12

Humair Khan wrote: seems like the limits went up instead of down

Humair Khan wrote: lol

Humair Khan wrote: for large

Humair Khan wrote: let's increase it by 20 cores

Humair Khan wrote: it's easy to increase it anyway

Anand Sanmukhani wrote: cool cool

Anand Sanmukhani wrote: updated

Anand Sanmukhani wrote: the quota that we set for this ns was just a guess any way

Humair Khan wrote: yeah, but it's good, this will help us tune it

Anand Sanmukhani wrote: I think we can reduce the cpu requests quota

Humair Khan wrote: yeah

Tom Coufal wrote: hm.. what if we remove the cpu limit in the quota for that namespace and leave the request?

Humair Khan wrote: I guess the question then becomes what we want to achieve with quotas, in my mind quotas are a hard bound for preventing exploding requests/limits

Tom Coufal wrote: I take that back.. I'm getting confused with this again.. sigh.. :confused:

Humair Khan wrote: hahahahaha

Humair Khan wrote: we should just let data build up on this and accumulate, so we can have a couple of months to look back on

Anand Sanmukhani wrote: yeah

Humair Khan wrote: if we never see limits pass a certain mark, we'll just reduce it accross the board

Anand Sanmukhani wrote: december might not be a good month for it tho

Tom Coufal wrote: wouldn't it be nice if there was a quota settings preventing total over utilization and not bound to some limits and request values? Similarly as you can say "you can't have more than 10 PVCs" that you would be able to say "you can't use more than X cores at the same time"

Humair Khan wrote: yeah, this would be fantastic

Tom Coufal wrote: I guess we need to wait for the next big thing after Kubernetes for that.. :smile:

Humair Khan wrote: but from what I understand, they don't due this due to the complexity of accurately retrieving usage metrics

Tom Coufal wrote: yeah

Anand Sanmukhani wrote: merging the PR

Anand Sanmukhani wrote: Aakanksha Duggal can you try spawning your nb again?

Aakanksha Duggal wrote: yes on it

Anand Sanmukhani wrote: looks like it worked

Aakanksha Duggal wrote: Yes! :thumbsup:

Aakanksha Duggal wrote: Thank you :smile:

Erik Erlandson wrote: > wouldn't it be nice if there was a quota settings preventing total over utilization and not bound to some limits and request values? Similarly as you can say "you can't have more than 10 PVCs" that you would be able to say "you can't use more than X cores at the same time" In general you can't implement this kind of resource policy without also implementing a preemption policy

Erik Erlandson wrote: if your quota is 10, and user A is using 5, user B is using 4, what happens if user B tries to increase to 6? does he get that? Can he "steal" it from user A? How does one allocate or re-allocate?

Erik Erlandson wrote: at the bottom, pods and their containers are cgroups - I can't remember if cgroups allow changing their cpu settings after the fact

_Transcript of Slack thread: https://operatefirst.slack.com/archives/C01RMPVUUK1/p1638285784149000?thread_ts=1638285784.149000&cid=C01RMPVUUK1_

Shreyanand commented 2 years ago

@erikerlandson To your last point, if user A is using 5 and user B is using 4, and both of their pods have a limit of 10 then from what I understand user B's workload can at max get 1 more cpu since A is actively using 5 since before B wanted to go to 6. Am I missing something here?

@HumairAK @tumido @4n4nd could we continue this discussion here? I guess whatever we conclude here can be used to improve the current approach of recommending resource profiles.