berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
63 stars 38 forks source link

Explore cost sharing model for EECS 16 A lab compute requirement! #3019

Open balajialg opened 2 years ago

balajialg commented 2 years ago

Summary

EECS 16 A (Designing Information Devices and Systems I) has more than 1000+ students enrolled as part of the course and uses EECS Hub. Students in the EECS 16 A lab currently face challenges running their large datasets using Datahub due to the CPU requirement and hence are using their local instance to run those commands. A recent conversation with the course manager revealed that they are interested in identifying a cost-sharing model to move their labs to the EECS hub. In the words of the EECS 16 A team,

The lab TA mentioned that the hurdle with Datahub seems to be with the CPU. They mentioned that they would need 2 CPUs per user, and are estimating ~70 simultaneous users every 3 hours during lab sessions. The APS labs run on weekdays for 2 weeks in November.

Created this issue as a nudge to start thinking about the cost-sharing agreement we would want to explore with the EECS folks. So let's scope this request either during our sprint meeting or as part of the strategy meeting to dig deep into this issue.

User Stories

Tasks to complete

balajialg commented 2 years ago

@ericvd-ucb cc'ing you in this issue as it has clear financial implications!

yuvipanda commented 2 years ago

There are two primary ways of doing this.

  1. We could run the EECS hub same way we run the data8x hub. Make a new GCP project, attach a billing account which can then be attached to its own chart string. This way, EECS can pay for exactly the amount of cloud resources used.
  2. We could use our monitoring (prometheus, grafana, and google cloud cost reporting) to make an estimate of total cost that is used by the EECS hubs. We can keep them in a separate dedicated nodepool, and figure out how much that costs using a defined formula, and somehow have EECS cover that.

(1) is easier organizationally given how Berkeley works, but overall more expensive because there is added complexity from having multiple projects as well as loss of efficiencies from economy of scale. I'd personally prefer (2)

balajialg commented 2 years ago

@yuvipanda Thanks for your thoughtful suggestions! I wonder if there is a way to visualize the EECS-specific cost in a dashboard that their admins can access if we go down route #2. One of the hypotheses that I have is that they would be interested to know the usage + cost associated with the hub on a real-time basis instead of static per month data that we could potentially share. I also view EECS engagement as a replicable model with other departments/divisions whenever we choose to engage in a cost-sharing agreement. As a result, view this opportunity as a way to optimize the process at our end. Let me know your thoughts!

balajialg commented 2 years ago

Didn't get a lot of time to delve deeper into the topic as we anticipated. Key highlights from the EECS cost-sharing conversation was that,

yuvipanda commented 2 years ago

@balajialg yeah, we can do that (dashboard) but going down the path of #2 would require effort as well. Primarily, we've to figure out how to cost the shared infrastructure (logging, storage, etc) that everyone uses. So we'll have to define a pricing model that is fair but easy to implement, and implement that.

balajialg commented 2 years ago

@yuvipanda Makes sense! Given the limited technical bandwidth, I guess it comes down to what is feasible to implement. In an ideal state, I wish we decide between options 1 and 2 before winter curtailment and then scope the implementation during Spring 2022. Then, we could use our next strategy meeting or use this thread to come to a conclusion. If we need to speak to EECS stakeholders to understand their preferences, then kickstarting the conversation soon would make sense.

On another note - At 2I2C, Do you follow option 1 or 2?

yuvipanda commented 2 years ago

@balajialg at 2i2c, we're currently following option 1 but hoping to move to option 2. Work can be pooled together there perhaps

ericvd-ucb commented 2 years ago

Hey there - I guess I would love to know what the monthly spend for EECS is - if this can be estimated by telemetry or something. I wanted to propose a framework where smaller users are in the main GCP project ( path 2) but at a certain size eg 4-500$/month or like $1500 per semester, then a separate project ( path 1) . And also maybe this is different for Ischool and EECS ( inside CDSS) vs Public Health and Biology ( outside CDSS)

ericvd-ucb commented 2 years ago

and - maybe we dont need to go all the way to build a dashboard, maybe we could just say - we did this analysis and applied these assumption and came to this reasonable approximation of cost per semester. Which it seems like you could do with datascience plus a model once per semester, vs building a whole dashboard

balajialg commented 2 years ago

@ericvd-ucb I love the framework that you proposed for cost-sharing! I volunteer for doing some modeling work if we have the raw data categorized across the hubs @yuvipanda. I don't know how the existing data looks like and so please ignore my request if it doesn't make sense.

I am all in for not creating additional work, and the descoping dashboard makes sense. It would be an excellent next step to estimate the cost for different hubs/major courses and log it in a document for future reference and analysis.

yuvipanda commented 2 years ago

Next step is to figure out how to estimate these costs from the data we have. I'll work out the data sources we have, and see what we can do.

balajialg commented 2 years ago

@yuvipanda What was the next step from your conversation with Eric Fraser? Can you update this thread when you are back?

balajialg commented 2 years ago
balajialg commented 2 years ago

Next Steps from March Sprint Planning Meeting:

Scope this issue with 2i2c efforts around developing billing solutions at a cluster level! @yuvipanda to update when would be a good time to synchronize efforts across both the teams!

balajialg commented 2 years ago

Our cloud costs for the last 12 months were closer to around ~$91,300. In order to figure out per hub costs, @yuvipanda ran this query in GCP Big query explorer to calculate per hub cloud costs across the entire year last year (Taking inspiration from this blog by Joe Hamman). You can check the results for this query in this spreadsheet here. I spent some time visualizing this data in R (Ref R Notebook) to get a sense of how the distribution across hubs look like (Refer the snapshot below). The X-axis denotes the different namespaces and the y-axis denotes the cloud costs for the past 12 months.

image

Interestingly, the total costs across all hubs (identified based on their namespace) are around ~$20,000. So, a couple of hypotheses based on this observation,

  1. Per hub costs contribute to almost 20% of the total costs. Hence, optimizing our savings here by partnering with interested departments may not prove to be worthwhile as it is not a significant cost overhead.
  2. Something else is contributing to the additional $70,000 which if optimized will lead to significant cost savings.

I spoke with @yuvipanda to check if he has a rationale for this discrepancy. His point was that currently VMs are spun up and get charged till they get shut down. There is a high possibility that these VMs are charged even when they are not actively used or partially used. There is a lot of room to optimize our infrastructure around when a VM gets launched when it gets shut down and how the storage gets efficiently allocated at the VM level. Specifically, there is a lot of technical scope around improving auto scalers in order to improve our cloud savings.

I am planning to close this issue or change this issue to focus more on the infra level optimizations considering the above perspective AND also John De Nero's views that cloud costs optimization is not a worthy return on investment. Let me know if anyone of you has strong opinions about the direction where this conversation is moving!

ryanlovett commented 2 years ago

Wow, thanks @balajialg ! Can you put your R program online somewhere for if/when this needs to be revisited?

balajialg commented 2 years ago

@ryanlovett Here is the link.. Also, I referenced this in the above post.

ryanlovett commented 2 years ago

@balajialg Arg, sorry for missing that!