choldgraf commented 3 years ago

Background

Many in the research community may have allotments on XSEDE Jetstream, which is US-based national infrastructure for research computing. There are theoretically ways to get a Kubernetes cluster running "quickly" in XSEDE via Magnum, as well as using kubespray.

I wonder if it would be useful for us to explore whether 2i2c can deploy to k8s on Jetstream in a cost-effective manner. I imagine this to be similar to how CloudBank works - we ask others to give us the keys to their Jetstream allotments, and can deploy to their projects without worrying about any of the cloud costs ourselves.

However, as we have learned with many cloud providers, deploying on K8S on one provider can be much more work than deploying on another. I suspect this will only be sustainable for us if there is minimal-to-zero difference in human cost to 2i2c in deploying to Jetstream, or if we can reliably generate more revenue from Jetstream customers vs. commercial cloud customers.

Questions to answer

[ ] How different is deploying JupyterHub on K8S on Jetstream compared with commercial cloud?
[ ] How much extra human labor would it take?
[ ] What extra benefit would we gain from being able to do this? (e.g., either $$$ or increased impact)

perhaps either @aculich or @zonca could provide some thoughts here?

zonca commented 3 years ago

I suggest to wait for Jetstream 2 to be operational after Summer 2021 and reconsider it at that point.

I agree that deploying and maintaining a Kubernetes on Jetstream is a significant amount of work compared to the commercial cloud. Moreover, there are not many users of Openstack Magnum so across updates of Openstack versions, Magnum can break. Jetstream was an experimental platform, more focused on simpler deployments. I think Jetstream 2 instead is being designed more with users through the Openstack API in mind, therefore the experience will be more polished.

So if there is a client specifically asking for it, you could rely on deploying via Kubespray (but not having a reliable autoscaler, I had to hack my own which you don't want to use in a large scale deployment), otherwise I would wait for Jetstream 2.

I'll be a early user of Jetstream 2 and will test deploying Kubernetes + JupyterHub.

yuvipanda commented 3 years ago

Going to close this, as we won't be doing anything with this just yet. Please let us know @zonca when Jetstream 2 comes online? Would love to get involved then :)

jmunroe commented 1 year ago

Jetstream 2 is now operational and some potential grant opportunties indicate that using this resource should be considered

[ ] https://github.com/2i2c-org/leads/issues/112

I am reopening this issue to encourage to us to revisit if the k8s support is sufficiently developed that we can reasonable expect to deploy our JupyterHub deployments and what level of effort is required.

jmunroe commented 1 year ago

Or a I would reopen this issue -- but maybe that's not an option in GitHub? (or is it a permissions issue for me?)

zonca commented 1 year ago

The Jetstream 2 team plans to deploy Openstack Magnum in the next months, at that point there will be an easier way to deploy Kubernetes.

Currently we need to use kubespray and I maintain a tutorial about it: https://www.zonca.dev/posts/2022-03-30-jetstream2_kubernetes_kubespray

Once Kubernetes is deployed, deploying JupyterHub is straightforward, it basically just needs to configure ingress, and add the service to provide HTTPS:

https://www.zonca.dev/posts/2022-03-31-jetstream2_jupyterhub

The configuration files are:

https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/config_standard_storage.yaml and the one created by https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/create_secrets.sh

yuvipanda commented 1 year ago

Thanks for the update, @zonca!

I think waiting for Magnum makes sense for us!

zonca commented 1 year ago

The Jetstream 2 team has just made Octavia available for load balancing. They were thinking about deploying Magnum, but considering Magnum is often out of date, they are actually testing out using the Cluster API. Will keep this thread updated.

yuvipanda commented 11 months ago

@zonca any idea where they're at now?

zonca commented 10 months ago

they decided to not provide Magnum. Cluster API is in their roadmap but low priority, they will look into it in the Fall, but other higher priority tasks could push it further away.

Currently the best way to deploy Kubernetes is via Kubespray using the tutorial I developed for the official Jetstream docs: https://docs.jetstream-cloud.org/general/k8skubespray/ given the long timescale you possibly want to explore this route. The main missing feature is autoscaling, however users can manually scale the deployment up and down.

I also have funding from Jetstream and can provide help in simplifying/improving the tutorial.

@julienchastang and @ana-v-espinoza have the most experience deploying this in production and can give some feedback.

aculich commented 10 months ago

@zonca thanks for the update on this long-lived issue. Still very interested in how this evolves.

yuvipanda commented 10 months ago

Thank you for the detailed response here, @zonca!

ana-v-espinoza commented 10 months ago

Hey all,

I would be glad to provide feedback or assistance. If there's anything in particular you'd like to know please ask away!

-- Ana V. E.

jmunroe commented 5 months ago

With NSF GEO OSE Project Pythia grant, this task is being considered again. Some recent notes based on conversations last week with people involved in Jetstream2 :

Julian Pistorius 5:16 PM Hi again! Our team is having a meeting about Kubernetes next week, and I'll keep your use case in mind I just stumbled across this, and it looks like Cluster API has support for autoscaling - no need for Senlin: https://release-1-1.cluster-api.sigs.k8s.io/tasks/cluster-autoscaler.html Did you see my email from last week?

Email from Julian Pistorius

James is Community Success Manager for 2i2c. We met at US-RSE'23: https://2i2c.org/author/james-munroe/

Jeremy is a co-PI of Jetstream2, and Le Mai manages the team responsible for Jetstream2 support. Their roles and responsibilities map to 'Community Success' for the Jetstream2 community.

I reached out to James yesterday about managed [Jupyter|Binder]Hubs on OpenStack, and he mentioned that he is involved with a new grant that will use Jetstream2: https://www.nsf.gov/awardsearch/showAward?AWD_ID=2324304 This capability is something that we've talked about for a while in Jetstream2 and has recently increased in priority. James, do you mind giving a bit more background on the grant, including specific deliverables which could be relevant?

@jmunroe's response

The relevant section on our grant is

3.2.2 Leveraging NSF-funded cyberinfrastructure to lower barriers to performant analysis Dedicated funding for performant BinderHubs on commercial cloud services will allow Cookbooks to flourish as a teaching and learning resource. But what pathways might we offer to users who want to take exemplar workflows and scale them up to do new science? One answer is to leverage existing NSF investments in cloud infrastructure. Through the ACCESS program, individuals or research groups are able to obtain significant allocations on Jetstream2. We envision a seamless experience in which users clone a Cookbook and launch an identical compute environment under their own persistent allocation. This will be particularly beneficial for workflows that use fully open datasets with no egress fees, such as those on the OpenStorageNetwork. Pythia will help build this capability through our partnership with 2i2c, driving new uses of NSF cyberinfrastructure while broadening participation in open science.

Pythia has already obtained a Discover ACCESS allocation for Jetstream2 resources. Under this allocation (and more that we will likely seek if this project is selected for funding), we will begin to deploy highly scalable, Kubernetes-backed Pythia JupyterHub and BinderHub on Jetstream2. There are some technical challenges to be addressed. One key difference between Jetstream2 deployments (which runs the OpenStack platform) and commercial cloud providers is the lack of an ‘auto-scaling’ or dynamic node allocation feature. When deploying JupyterHub or BinderHub to Amazon, Google, or Microsoft’s cloud, 2i2c is able to use vendor-supplied solutions that dynamically scale the number of nodes based on the demand. This keeps the deployment cost low while allowing for bursts of computing when there is increased demand. While OpenStack might have support for a similar feature through its Senlin clustering service, this does not appear yet to be a supported feature for Jetstream2. We have budgeted for anticipated development time to address these challenges, which may result in a significant broader impact in the form of enhanced Jetstream2 capabilities.

We specifically propose two different deployments, to be launched ASAP upon project funding: (1) JupyterHub and BinderHub on Google Cloud (budgeted in a separate CloudBank request) for use by Pythia Cookbook users and developers, with autoscaling available to keep costs manageable. (2) JupyterHub and BinderHub under a fixed allocation on Jetstream2. If Jetstream2 gains the additional functionality to enable an auto-scaling clustering service such as Senlin, 2i2c will prototype a dynamically scalable BinderHub deployment during years 2 or 3 of the funded project. These activities will be led, managed, and maintained by 2i2c with input from the existing Pythia infrastructure team.

Project Pythia currently uses a kubespray-backed control plan with a BinderHub on a static number of virtual machines on Jetstream2 that is managed by a team at UAlbany. The specific grant deliverable is deploying JupyterHub and Binderhub on Jetstream2 using a terraform defined, dynamically scalable, k8s-backed control plane. 2i2c currently deploys JupyterHub infrastructure with kubernetes on AWS, GCP, and Azure with the intention of being cloud-agnostic. We would love to add OpenStack in general (and Jetstream2 in particular) to platforms where we can deploy JupyterHub/BinderHub. One of my meta-goals for this work which I didn't spell out explicitly in the grant is improving capacity in general of multi-cloud deployments of JupyterHub/BinderHub so science users can be seamlessly transition from one cloud to another take advantage of either data proximity and/or compute availability in an effort to lower costs across many NSF projects.

What I have been told by our 2i2c engineering team is that a self-managed kubernetes control plane (such as using kubespray) is not sustainable or scalable for our team. Our avoidance of managing the inner workings of a kubernetes deployment is not due to a lack of experience with kubernetes; in contrast I feel the 2i2c engineers are aware just how much effort it is to keep a managed kubernetes service up and running in the long term.

Julian has pointed me to OpenStack ClusterAPI as a possible solution for kubernetes on Jetstream2. That very well might be the missing piece of technology on OpenStack needed to make this work.

As you continue to plan regarding next steps for Jetstream2, please let me know if the 2i2c team can be of assistance either in the requirements gathering, development, or testing phases. If that would be helpful, I think we could consider it part of 2i2c work under our larger NSF grant. If you'd like to meet with one of our engineers (or have them attend a Jetstream2 focused meeting), please let me know I will try and coordinate.

jmunroe commented 4 months ago

I think we are ready to schedule a technical sync meeting with some folks from Jetstream2 and some folks from 2i2c to discuss issues related to deploying hubs on Jetstream2. My understanding it the primary blocker, from the 2i2c-side, is the need for a managed Kubernetes layer on Jetstream2.

Email from Julian Pistorius received today:

Thank you for the background information James. This is very exciting.

I think a meeting with engineers on your side would be good. What should we read/watch/try before such a meeting in order to make it as productive as possible?

@jmunroe's response:

When 2i2c refers to 'deploying JupyterHub on Kubernetes' it is a scalable and repeatable way of applying the workflow documented as Zero-to-JupyterHub (z2jh). There is significant overlap between the main contributors to z2jh and the team at 2i2c. So one approach would be think how a to-be-written section titled 'Kubernetes on Jetstream2' would appear in the Setup Kubernetes section of the z2jh guide.

Within 2i2c, we deploy many (100+) JupyterHubs in a cloud-agnostic way (currently supporting GCP, AWS, and Azure) building on top of the helm charts in the JupyterHub project. . Our internal documentation regarding deploying a Kubernetes cluster is found at the Add Kubernetes Cluster section. So another way of discussing the Kubernetes requirements would be to imagine what a 'New kubernetes cluster on Jetstream2' section might contain. Everything 2i2c does is open (see https://github.com/2i2c-org/infrastructure for the configuration of all of the hubs we maintain), upstream-first in our contributions, and actively try and live out what we call the Right to Replicate as a vision for open science infrastructure.

If 'watching' is preferential to reading, I'd point you a talk 2i2c engineering Sarah Gibson gave at JupyterCon 2023 on "No Magic Added Deploying Multiple JupyterHubs to Multiple Clouds from one Repository" for a behind the scenes view of what exactly 2i2c does and how that all sits on top of a managed kubernetes layer.

Looking at various people's calendars, I wonder if we could try and arrange a 60 min sync session the first week of March. Here's a when2meet to try and identify a slot that works. (If first week of March is not an option, please let me know and we can reschedule). How does that sound?

@damianavila : I've added this our cross-functional initiative board so we can get this discussion and a possible meeting appropriately prioritized.

jmunroe commented 4 months ago

A meeting including representation from Jetstream2 (Indiana University) and 2i2c occurred on 2024-03-04. The purpose of the meeting was to discuss what were essential "must-haves" in a managed kubernetes layer on Jetstream2 to be allow deploying of a JupyterHub/BinderHub in the sense of following the Zero-to-Jupyterhub deployment guide.

Yuvi summarized the that essential components of a managed kubernetes layer needed for JupyterHub are

[ ] Control Plane abstraction (+HA?)
[ ] Dynamic Provisioning for PVC (hub db, prometheus, etc)
[ ] Control plane upgrades + Node Upgrades
[ ] Node Autoscaling abstraction
[ ] service LoadBalancer support (gets traffic into the cluster)

Currently, the best documentation available for deploying Kubernetes and JupyterHub on Jetstream2 are the blog posts by Andrea Zonca.

Julian Pistorius and Jeremy Fisher reported that improving kubernetes support and having recommended solution for JupyterHub on JS2 is on their road map for 2024.

The meeting ended with the suggestion that new "blog post" is the right way to iterate on a solution and then have that solution upstreamed into Zero-to-JupyterHub.

I've created this issue an upstream place to continue this discussion and work:

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/3354

Additional references:

Video recording of 2024-03-04 meeting: https://drive.google.com/file/d/1CwAy1vMnagDwTy-yYS3fa2myc2x3iYUV/view?usp=drive_link

julianpistorius commented 4 months ago

Thank you @jmunroe!

Relevant Jetstream2 issues:

Prove and document use of Kubernetes Cluster API: https://gitlab.com/jetstream-cloud/project-mgt/-/issues/108
Explore Jupyter interface for Jetstream2: https://gitlab.com/jetstream-cloud/project-mgt/-/issues/100

yuvipanda commented 4 months ago

/cc @cboettig, who is also deeply invested in using Jetstream for this kinda stuff.

2i2c-org / infrastructure

Discuss deploying hub infrastructure to Jetstream #188

Background

Questions to answer