berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
67 stars 39 forks source link

Increase compute resources for Course POLI SCI 3 Mondays & Wednesdays 5pm-6:30pm Spring 2024 #5315

Closed dbroockman closed 10 months ago

dbroockman commented 11 months ago

Course Name

Broockman, POLI SCI 3

Detailed Requirements

Flipped classroom course, ~350 students will be using R Datahub every Monday & Wednesday from 5pm-6:30pm.

Semester Details

Yes

Request Deadline

January 22, 5pm

dbroockman commented 11 months ago

Note: in the past we've had issues with needing to bump up compute because all students enter the notebook at the same time, e.g., see #4009 . Please allocate adequate resources :). Thank you!

balajialg commented 11 months ago

Thanks for the context @dbroockman! I created a Google calendar event that allocates 2 extra spare nodes just before the start of the classes on Monday and Wednesday. I will keep the issue open in case you want to report issues with scale-up.

dbroockman commented 11 months ago

Thanks. Looking at #4009 from last year it looks like 8 nodes were initially allocated and that turned out not to be enough. So I am worried about whether 2 will be enough. Thoughts?

ryanlovett commented 11 months ago

@dbroockman The two nodes that @balajialg is referring to are hot spares. This means that if n represents the number of nodes that your class fits on at any given time, there will be n+2 nodes online.

The cluster normally scales up nodes to match demand, so that if all nodes are occupied and one more user logs in, it will start up a new node. However it takes a few minutes for each node to spin up, so that user will see a delay. When a lot of people try to start their servers at the same time, surpassing the rate that nodes can start up, it can cause user facing problems. We can configure the cluster to have spare nodes in reserve, so it can instantly make them available when new nodes are needed. The only downside to having these hot spares online all of the time is that they're doing nothing, which "wastes" resources when the rate of user server startups is low. So the middle ground is to schedule the creation of spare nodes when know there will be a flood of users.

dbroockman commented 11 months ago

Thanks. Yes, the way my class works, all 350 students will be entering a notebook at the exact same time at 5pm on Mondays and Wednesdays. I'd really like to make sure we don't have students stuck on loading screens for 4-5 minutes, because it's a timed assignment where they only get 20 minutes to complete it. It will also be their first impression of JupyterHub.

balajialg commented 11 months ago

Thanks @ryanlovett!

@dbroockman Based on last year's estimate, almost 100 pods (user servers) were packed into a single R Hub node. I am guessing the default node allocation for the R hub would be 2 nodes and then there are 2 hot spares allocated through the calendar event before the start of the class. Unless I am missing something, this should account for 350 students trying to log in between 4:30 and 5:15.

@ryanlovett Thoughts? Should we be more generous with hot spares?

ryanlovett commented 11 months ago

@balajialg It appears that last semester, the default number of placeholders for r hub was 1. If we set it to 2 using the calendar, and if each node can accommodate 100 pods, then that means that there would be immediate capacity for 200 new servers, plus the unused capacity on the currently active nodes.

If 350 students are truly logging in at the same moment, every class, then we could set the number of spares to be 3 for a ~10 minute period around that moment. This would account for 400 users of R hub, though not all are polisci students. If fewer students are actually logging in, or if the ramp up is over a period of 10-15 minutes at the start of class and not at the same moment, then 2 seems reasonable.

Let me organize this a little...

Most anxious scenario:

More realistic scenario:

So this is me thinking out loud. A spare count of 3 is very conservative although perhaps okay for the first week. If the data shows that it is overkill, it can be reduced.

dbroockman commented 11 months ago

How much do each of these nodes cost to spin up / run for a couple hours? I'd suggest overprovisioning to be safe, yes.

balajialg commented 11 months ago

@dbroockman Based on the estimate for n2-highmem-8 in https://cloud.google.com/compute/all-pricing, approximately it should cost around $1 for a couple of hours (not super expensive). I guess the admins can verify in case I missed something in my estimate.

I increased the hot spares to 3 for now which should accommodate ~300 users based on our current understanding.

shaneknapp commented 11 months ago

@dbroockman Based on the estimate for n2-highmem-8 in https://cloud.google.com/compute/all-pricing, approximately it should cost around $1 for a couple of hours (not super expensive). I guess the admins can verify in case I missed something in my estimate.

I increased the hot spares to 3 for now which should accommodate ~300 users based on our current understanding.

yep, that's about right. instance cost is ~$400/month, and with ~730 hours/month, we get ~$0.54/hr for each placeholder node.

https://cloud.google.com/products/calculator?hl=en&dl=CiRhM2U0Y2ZjZS02Yjc5LTRhMzItOGM3ZC0yZTA4ODIyNTMzOGQQCBokOUM1OEZCOTEtOUFBOS00QThBLTkxREUtNURBRjY0RDVDMzZG

we just need to make sure that we spin these down when not needed...

dbroockman commented 11 months ago

Thanks. Given how cheap this is, I'd ask that you err on the side of more nodes during my class. Thank you!

balajialg commented 10 months ago

@dbroockman How did the class go today? Did any students face issues with launching R Hub?

We generously provisioned placeholder nodes before the start of the class today (through the calendar event)

image
dbroockman commented 10 months ago

All good today thank you!

balajialg commented 10 months ago

Great! Closing this issue for now. Please reopen if we need to troubleshoot an issue.