Open balajialg opened 3 years ago
@ericvd-ucb cc'ing you in this issue as it has clear financial implications!
There are two primary ways of doing this.
(1) is easier organizationally given how Berkeley works, but overall more expensive because there is added complexity from having multiple projects as well as loss of efficiencies from economy of scale. I'd personally prefer (2)
@yuvipanda Thanks for your thoughtful suggestions! I wonder if there is a way to visualize the EECS-specific cost in a dashboard that their admins can access if we go down route #2. One of the hypotheses that I have is that they would be interested to know the usage + cost associated with the hub on a real-time basis instead of static per month data that we could potentially share. I also view EECS engagement as a replicable model with other departments/divisions whenever we choose to engage in a cost-sharing agreement. As a result, view this opportunity as a way to optimize the process at our end. Let me know your thoughts!
Didn't get a lot of time to delve deeper into the topic as we anticipated. Key highlights from the EECS cost-sharing conversation was that,
@balajialg yeah, we can do that (dashboard) but going down the path of #2 would require effort as well. Primarily, we've to figure out how to cost the shared infrastructure (logging, storage, etc) that everyone uses. So we'll have to define a pricing model that is fair but easy to implement, and implement that.
@yuvipanda Makes sense! Given the limited technical bandwidth, I guess it comes down to what is feasible to implement. In an ideal state, I wish we decide between options 1 and 2 before winter curtailment and then scope the implementation during Spring 2022. Then, we could use our next strategy meeting or use this thread to come to a conclusion. If we need to speak to EECS stakeholders to understand their preferences, then kickstarting the conversation soon would make sense.
On another note - At 2I2C, Do you follow option 1 or 2?
@balajialg at 2i2c, we're currently following option 1 but hoping to move to option 2. Work can be pooled together there perhaps
Hey there - I guess I would love to know what the monthly spend for EECS is - if this can be estimated by telemetry or something. I wanted to propose a framework where smaller users are in the main GCP project ( path 2) but at a certain size eg 4-500$/month or like $1500 per semester, then a separate project ( path 1) . And also maybe this is different for Ischool and EECS ( inside CDSS) vs Public Health and Biology ( outside CDSS)
and - maybe we dont need to go all the way to build a dashboard, maybe we could just say - we did this analysis and applied these assumption and came to this reasonable approximation of cost per semester. Which it seems like you could do with datascience plus a model once per semester, vs building a whole dashboard
@ericvd-ucb I love the framework that you proposed for cost-sharing! I volunteer for doing some modeling work if we have the raw data categorized across the hubs @yuvipanda. I don't know how the existing data looks like and so please ignore my request if it doesn't make sense.
I am all in for not creating additional work, and the descoping dashboard makes sense. It would be an excellent next step to estimate the cost for different hubs/major courses and log it in a document for future reference and analysis.
Next step is to figure out how to estimate these costs from the data we have. I'll work out the data sources we have, and see what we can do.
@yuvipanda What was the next step from your conversation with Eric Fraser? Can you update this thread when you are back?
Next Steps from March Sprint Planning Meeting:
Scope this issue with 2i2c efforts around developing billing solutions at a cluster level! @yuvipanda to update when would be a good time to synchronize efforts across both the teams!
Our cloud costs for the last 12 months were closer to around ~$91,300. In order to figure out per hub costs, @yuvipanda ran this query in GCP Big query explorer to calculate per hub cloud costs across the entire year last year (Taking inspiration from this blog by Joe Hamman). You can check the results for this query in this spreadsheet here. I spent some time visualizing this data in R (Ref R Notebook) to get a sense of how the distribution across hubs look like (Refer the snapshot below). The X-axis denotes the different namespaces and the y-axis denotes the cloud costs for the past 12 months.
Interestingly, the total costs across all hubs (identified based on their namespace) are around ~$20,000. So, a couple of hypotheses based on this observation,
I spoke with @yuvipanda to check if he has a rationale for this discrepancy. His point was that currently VMs are spun up and get charged till they get shut down. There is a high possibility that these VMs are charged even when they are not actively used or partially used. There is a lot of room to optimize our infrastructure around when a VM gets launched when it gets shut down and how the storage gets efficiently allocated at the VM level. Specifically, there is a lot of technical scope around improving auto scalers in order to improve our cloud savings.
I am planning to close this issue or change this issue to focus more on the infra level optimizations considering the above perspective AND also John De Nero's views that cloud costs optimization is not a worthy return on investment. Let me know if anyone of you has strong opinions about the direction where this conversation is moving!
Wow, thanks @balajialg ! Can you put your R program online somewhere for if/when this needs to be revisited?
@ryanlovett Here is the link.. Also, I referenced this in the above post.
@balajialg Arg, sorry for missing that!
Summary
EECS 16 A (Designing Information Devices and Systems I) has more than 1000+ students enrolled as part of the course and uses EECS Hub. Students in the EECS 16 A lab currently face challenges running their large datasets using Datahub due to the CPU requirement and hence are using their local instance to run those commands. A recent conversation with the course manager revealed that they are interested in identifying a cost-sharing model to move their labs to the EECS hub. In the words of the EECS 16 A team,
The lab TA mentioned that the hurdle with Datahub seems to be with the CPU. They mentioned that they would need 2 CPUs per user, and are estimating ~70 simultaneous users every 3 hours during lab sessions. The APS labs run on weekdays for 2 weeks in November.
Created this issue as a nudge to start thinking about the cost-sharing agreement we would want to explore with the EECS folks. So let's scope this request either during our sprint meeting or as part of the strategy meeting to dig deep into this issue.
User Stories
Tasks to complete
[ ] Identify a cost-sharing model for the required compute!
[x] Yuvi to have a conversation with Eric Fraser with regards to initiate this conversation