Deploy and operate a BinderHub for Pangeo

choldgraf commented 2 years ago

Description / problem to solve

Problem description The Pangeo BinderHub has been down for about a month (due to crypto mining, but also because it did not have operational support to keep it going sustainably). The Pangeo community made heavy of use their Binder deployment, and it powered a lot of reproducible sharing (e.g., via gallery.pangeo.io.).

Proposed solution We should deploy a BinderHub on the 2i2c deployment infrastructure that can live in parallel to the JupyterHub we run for the Pangeo community. We'll need to make a few modifications to their setup (including using up-to-date binderhub versions and locking down auth more reliably).

What's the value and who would benefit This would allow the Pangeo community to re-gain the use of their BinderHub, which would benefit many people!

Implementation guide and constraints

There are a few things that we should consider here:

We'll need to update our configuration and CI/CD infrastructure to be able to deploy a BinderHub chart, since we're currently assuming JupyterHub.
We want to make sure we don't deploy a BinderHub and immediately run into the same crypto issues that Pangeo ran into before

Here's a GitHub issue where @scottyhq describes the environment that was available on the Pangeo BinderHub: https://github.com/pangeo-data/pangeo-binder/issues/195#issuecomment-989107771

Updates and ongoing work

Here are a few major issues that would need to be tackled as part of this effort:

[x] https://github.com/2i2c-org/infrastructure/issues/857 (so that we can generalize our config beyond just JupyterHub)
[x] https://github.com/2i2c-org/infrastructure/issues/879 (if we want to strengthen the CI/CD infrastructure in general before deploying new things, we might decide to reduce the scope a bit if need be)
[x] https://github.com/2i2c-org/infrastructure/issues/1280
[ ] https://github.com/2i2c-org/meta/issues/279
[x] https://github.com/2i2c-org/infrastructure/issues/328
[x] #1823
[x] #1824

Admin

Champion: @sgibson91
Supporter/Collaborator: @octocat
Project board: https://github.com/orgs/2i2c-org/projects/31

choldgraf commented 2 years ago

cc @rabernat and @sgibson91 - is there anything major here that I am missing? I believe that @sgibson91 is working with @consideRatio on https://github.com/2i2c-org/infrastructure/issues/857 right now, which is laying the foundation to letting us deploy BinderHubs from the Pangeo cluster CI/CD.

sgibson91 commented 2 years ago

@choldgraf I think this is a really nice outline of the work that needs to be done to get us into a position where we are ready to deploy a BinderHub. I'm happy with how this is and add to the list of tasks as an when they arise

sgibson91 commented 2 years ago

https://github.com/pangeo-data/pangeo-binder/issues/194

I'm also linking this issue as a future reminder to myself to ask about container registries for Pangeo Binder, but that is a ways down the road yet.

alxmrs commented 2 years ago

Here's a left-field suggestion: What if we don't implement a log-in system, because we never host a remote server for the jupyter notebook -- and the experience stays mostly the same?

Specifically, could we run the notebooks in the browser in a Wasm python environment via JupyterLite? Here, the demo notebooks could be hosted on a static webpage.

alxmrs commented 2 years ago

For context: The use case I had in mind was for distributing low-friction demos that don't require a log-in. This is related to the "whitelist vs blacklist" discussion around log ins in today's meeting.

choldgraf commented 2 years ago

@alxmrs if we could define a subset of workflows and/or datasets that were possible to use in JupyterLite, this would definitely be a faster way to onboard people into the Pangeo community. I think the trick will be figuring out the "hand-off" between JupyterLite and a situation where you need a fully-loaded environment, so that it doesn't confuse or frustrate people.

But at the very least, it shouldn't be too hard to try a demo out. For example, here's the repository that serves the JupyterLite instance linked from try.jupyter.org:

https://github.com/jupyter/try-jupyter

That shouldn't be too hard to replicate for Pangeo's use-case. I bet you could curate a few notebooks that showed off basic functionality to get people started (but it probably wouldn't work for the more advanced things like Dask Gateway, Zarr, etc).

sgibson91 commented 2 years ago

I'd like to start working on this in the next couple of weeks. @yuvipanda are there any strategy discussions we need to have?

Questions I have:

Is it sensible to reuse the existing pangeo-hubs cluster for this binderhub or is a new cluster needed?

damianavila commented 2 years ago

I presume this project might need a dedicated project board to collect all the associated issues.

rabernat commented 2 years ago

I am very happy to see this moving forward! 🤩

Is it sensible to reuse the existing pangeo-hubs cluster for this binderhub or is a new cluster needed?

This will be paid from the same grant that is covering the current GCP Pangeo Hub (EarthCube Pangeo Forge award). So they will go to the same billing account. If it is easier to put everything in one cluster, that's fine with me. From the "hub owner" perspective, it would still be useful to be able to segregate costs for the binder.

yuvipanda commented 2 years ago

So sorry for the delay, @sgibson91.

I think this should run in a different GCP project, and ideally a project that 2i2c bills for rather than one that columbia manages. I can't find the issue where we discussed this, but I remember @choldgraf mentioning that we can setup a project ourselves and bill columbia for it. Let's do that so we simplify our cloud access story? https://infrastructure.2i2c.org/en/latest/howto/cloud/new-gcp-project.html has info on setting up a new one. We can temporarily start out on the existing project to prototype if needed, but should switch out to new project.
Structurally, I'd imagine we'd make a 'binderhub' helm-chart, which has a dependency on both binderhub and dask-gateway. We can use condition to disable dask-gateway in future binderhubs that don't need this. A big problem here is the lack of composability in helm, and we will have to duplicate a bunch of things from our basehub and daskhub chart values.yaml files :( Is there any way to avoid this? We'd also need to make sure the z2jh version matches what we have in basehub, and I'm not sure how exactly to do that either.
For Auth, let's use CILogon. And during development, I think we can restrict login (in similar way to https://github.com/2i2c-org/infrastructure/pull/1218), but longer term we should open it up to everyone with CILogon access (many people do not have a .edu email address!). But this should protect us from miners while we figure that out.

choldgraf commented 2 years ago

Regarding 2i2c paying for cloud. I think that this would require a change to the contract that 2i2c has with Pangeo (which currently only covers personnel costs). Can we do two things:

@rabernat could you confirm that the approach @yuvipanda describes above is what you'd like to go with?
If "yes", then our next step on the admin side is to ask CS&S to request an amendment (or an addition?) to the current sub-award contract.

sgibson91 commented 2 years ago

While we wait for @rabernat to update us on the contracting question, I believe the below issue is at least actionable. I will open an issue to track it.

Structurally, I'd imagine we'd make a 'binderhub' helm-chart, which has a dependency on both binderhub and dask-gateway. We can use condition to disable dask-gateway in future binderhubs that don't need this. A big problem here is the lack of composability in helm, and we will have to duplicate a bunch of things from our basehub and daskhub chart values.yaml files :( Is there any way to avoid this? We'd also need to make sure the z2jh version matches what we have in basehub, and I'm not sure how exactly to do that either.

EDIT: Issue is here https://github.com/2i2c-org/infrastructure/issues/1280

rabernat commented 2 years ago

I think that this would require a change to the contract that 2i2c has with Pangeo

Let's get the relationships straight. Pangeo has no contract with anyone. Columbia has a contract with 2i2c. ACAICT there are in fact 3 separate contracts now supporting Pangeo-related things (NSF Earthcube @ Columbia, LEAP @ Columbia, M2LInES @ NYU).

I think this should run in a different GCP project, and ideally a project that 2i2c bills for rather than one that columbia manages. I can't find the issue where we discussed this, but I remember @choldgraf mentioning that we can setup a project ourselves and bill Columbia for it. Let's do that so we simplify our cloud access story?

This will be complicated to set up. We have only established such a contract already with NYU, not Columbia. It will require considerable administrative overhead. I would estimate 2 months to revise the existing contract. And there is always the possibility that Columbia may reject the proposal that 2i2c will bill us directly for cloud usage.

Because the cloud costs for this project are exempt from ICR, it is essential that the cloud bill be segregated from the "services" bill.

All that said, I'm fine with trying.

yuvipanda commented 2 years ago

@rabernat thanks for offering to try! I think it'll definitely simplify setup and longer term operations.

choldgraf commented 2 years ago

Thanks @rabernat for sharpening my language - I agree that we need to be clear what organizations are on each side of contracts!

For this case, it sounds like:

It would be easier for 2i2c and operations in the long-term if we have control over the cloud infrastructure.
However it might be complicated to set this up with Columbia.

So, how about I ask CS&S to investigate with the Columbia admin whether this would be complicated to set up. If it seems like it will be massively complicated, then we stick with the status quo and kick the can down the road. If it will not be complicated (say, will take ~ 1 month to set up) then we give this a shot.

If we do set this up, we'd also need the following constraints:

There is a separate invoice sent for cloud infrastructure (or @rabernat is it enough that it be a separate line item on a single invoice?)
This only applies to the BinderHub, not the JupyterHub we're already running
Anything else?

rabernat commented 2 years ago

There is a separate invoice sent for cloud infrastructure (or @rabernat is it enough that it be a separate line item on a single invoice?)

I think it would really be easiest if we got two separate invoices. Otherwise our admins will have to split the charge manually between two different accounts.

choldgraf commented 2 years ago

Hey all - I fleshed out some of the issues around the administrative / cloud payment challenges here, and added that to our list at the top. See some more conversation in that here:

https://github.com/2i2c-org/meta/issues/279

sgibson91 commented 2 years ago

We have a test Binder that is up and running on our pilot-hubs cluster! 🎉 All the infrastructure is there to make this repeatable, including auto-deployment through CI/CD. So the only thing blocking progress on reinstating the Pangeo Binder on GCP is the credits situation with Columbia.

choldgraf commented 2 years ago

Wanted to note that I heard recently from @cgentemann that there are several communities within the NASA ecosystem that would also benefit from having BinderHubs for their workshops and events. This isn't quite the Pangeo community, but it's a useful datapoint to know where people would find value in these Binder services.

The only catch is that all of their data lives in AWS, not in GCP. I don't know how difficult it would be to adapt our infrastructure to AWS as well but just wanted to note this.

sgibson91 commented 2 years ago

I don't know how difficult it would be to adapt our infrastructure to AWS as well but just wanted to note this

At the minute, it's very hacked together to specifically work with Google Artifact Registry for image storage. We should absolutely fix that, but I actually think we could use an eks cluster with a GAR since the cluster and registry are connected through a service account that is provided as a username/password in the hub config, rather than any k8s-level connection. It shouldn't be too much effort to get the BinderHub working on AWS, BinderHub is already cloud-agnostic, it's more about picking the right templates/config from basehub/daskhub to get the features the community need/want.

Generally, this BinderHub is sort of hacked together because we don't know how https://github.com/2i2c-org/infrastructure/issues/1382 will pan out and it didn't feel beneficial to get a full solution for BinderHub up-and-running when it could all be torn down and refactored in the not-too-distant future.

choldgraf commented 2 years ago

That's helpful context! So it sounds like:

If it would be really beneficial to NASA communities, then it might be worth adapting this to run on AWS as well and deploy it for a few of those communities (with the expectation that things might change as https://github.com/2i2c-org/infrastructure/issues/1382 is addressed)
So we should learn from @cgentemann how important this would be to do quickly

sgibson91 commented 2 years ago

Yeah, BUT I also don't want us to start running a whole bunch of hacked together BinderHubs, as that is just loading us up for a giant migration effort when #1382 takes shape/lands. We should maybe cap ourselves at 2-3 (or some other reasonable amount)?

rabernat commented 2 years ago

FWIW, we have another zombie binder running on AWS, https://hub.aws-uswest2-binder.pangeo.io/. It is being run by a skeleton crew of @scottyhq.

As long as we are looking at AWS, I would be very happy to see a path towards moving this binder into a more stable situation. Perhaps we can kill multiple birds here.

damianavila commented 2 years ago

I actually think we could use an eks cluster with a GAR since the cluster and registry are connected through a service account that is provided as a username/password in the hub config, rather than any k8s-level connection.

Even when that is possible, it maybe makes sense to also explore AWS ECR as well? I guess there will be some benefits to having everything in AWS land at the time to retrieve/fetch images...

yuvipanda commented 2 years ago

Specifying passwords as we have done is the only way binderhub can push to registries right now (I opened https://github.com/jupyterhub/binderhub/issues/1506), and that's also mostly ok in this context I think. I also don't think your GAR setup is too hacky, @sgibson91! It could be extended to AWS without too much difficulty I think.

If we have the money to run other binderhubs, I think we can now.

I agree #1382 is the way to go but I also worry that's a long way away, and as long as we don't make decisions here that bind possible ways to pursue #1382 i think it's ok to get some more binderhubs running.

yuvipanda commented 2 years ago

I also did a tiny amount of cleanup in our base binderhub helm chart here: https://github.com/2i2c-org/infrastructure/pull/1467

yuvipanda commented 2 years ago

I just spoke to @scottyhq and he's happy for us use the AWS account on which the current AWS binder is running on to experiment with running binder on AWS. I might give it a shot...

sgibson91 commented 2 years ago

During the "Connecting on Pangeo Binder" meeting held Friday, Sept 16th, 2022 (notes: https://docs.google.com/document/d/1P8VE6ptmPAEFfrSO9GRma7JHdZS4XBC-EN32J2Ka5hY/edit), we discussed what value add hosting an unauthenticated Pangeo Binder would actually serve and maybe we should setup a Binder on AWS with the Pangeo/Columbia grant and add this to the mybinder.org federation instead.

I discussed this idea with the JupyterHub/Binder team at the monthly team meeting (notes: https://github.com/jupyterhub/team-compass/pull/567) and they were keen to have an AWS federation member, since that is not currently represented. They wanted to know what differences there would be between this Binder and the others in the federation, and I think there's actually minimal differences, especially since Ryan no longer wanted to enable dask gateway on an open Binder instance. So we may just need to tweak CPU/RAM guarantees.

damianavila commented 2 years ago

Thanks for the update, @sgibson91!!

Given the above general +1 from the jupyterhub team, what would be the next operational steps here?

Reply to the jupyterhub team about the differences (if there are any)
Setup a Binder on AWS with the Pangeo/Columbia grant
Add this Binder to the mybinder.org federation

What pieces am I missing? Not sure about the complexity of step 3, btw.

sgibson91 commented 2 years ago

I think there's a step 1.5 which is Ryan setting up an AWS account and adding the 2i2c engineering team.

Steps 2 and 3 are kind of the same since we will put helm chart config in the jupyterhub/mybinder.org-deploy repo. It's kind of complex because the mybinder.org helm chart isn't well documented and has extra features on top of the usual binder stuff, such as the events archive. I've found it's a matter of deploy and see what breaks, try to fix it, and repeat. On the upside, we can use this as an opportunity to write some "technical guide to joining the mybinder.org federation" documentation.

damianavila commented 2 years ago

On the upside, we can use this as an opportunity to write some "technical guide to joining the mybinder.org federation" documentation.

That would be great!!

scottyhq commented 2 years ago

I just spoke to @scottyhq and he's happy for us use the AWS account on which the current AWS binder is running on to experiment with running binder on AWS. I might give it a shot...

We have $10k in AWS credits remaining from our original NASA ACCESS 2017 project and would love for 2i2c to use that account for prototyping and development of an binderhub on AWS. @yuvipanda still has access, we can connect on email or slack to coordinate access for others at 2i2c if that's helpful!

damianavila commented 2 years ago

@yuvipanda get access to @scottyhq AWS for 2i2c engs (will be assigned to @yuvipanda).
Reproducing Kube workflow in my binder-org repo (@sgibson91 to link hijacked issue)

sgibson91 commented 2 years ago

I've been discussing this idea on this issue in the team-compass repo:

https://github.com/jupyterhub/team-compass/issues/501

sgibson91 commented 2 years ago

Current status:

Yuvi granted me (and other 2i2c engineers) access to Scott's AWS account as mentioned in https://github.com/2i2c-org/infrastructure/issues/919#issuecomment-1261285423
I will begin actioning #1824, but will mostly likely track the work in a (yet to be opened) issue on the mybinder.org-deploy repo for visibility with the mybinder operating team (I will update with a link when I create it)

We have $10k in AWS credits remaining from our original NASA ACCESS 2017 project and would love for 2i2c to use that account for prototyping and development of an binderhub on AWS.

Given the above caveat of Scott's AWS account, how/where should we track the setup of the AWS account associated with the Columbia grant?

damianavila commented 2 years ago

Given the above caveat of Scott's AWS account, how/where should we track the setup of the AWS account associated with the Columbia grant?

I do not fully understand your question, @sgibson91, can you elaborate a little bit more, thanks!

sgibson91 commented 2 years ago

We are setting up the Binder to replace the AWS Binder the Pangeo community were originally operating and funding it out of the Columbia grant
There is only $10,000 USD on the account Scott gave us access to but it is not an active account. When that money is gone, it's gone. It's also not currently connected to the Columbia account.
We need a long-term account that will be paid for by the Columbia grant for the sustainability of the new AWS Binder. Who is/should be in charge of setting that up?

damianavila commented 2 years ago

Thanks for the additional context.

We need a long-term account that will be paid for by the Columbia grant for the sustainability of the new AWS Binder. Who is/should be in charge of setting that up?

That is a really good question I am not sure of the answer to it... Given the previous experience, I would say let's make 2i2c responsible to create the AWS account and then pass through the costs, but that also exposes us to some additional risk, I would say. Additionally, I am not sure what is possible from the Columbia grant side, actually...

sgibson91 commented 2 years ago

Right, and if we pass through costs like that I believe we actually have to change our contract with Columbia, as documented below regarding moving the GCP infrastructure to a 2i2c-managed project

https://github.com/2i2c-org/meta/issues/279#issuecomment-1285294965

damianavila commented 2 years ago

Adding @jmunroe into this conversation because there will be contract amendments involved/needed.

sgibson91 commented 2 years ago

I opened the following upstream issue to track the technical deployment of the infrastructure to mybinder.org-deploy

https://github.com/jupyterhub/mybinder.org-deploy/issues/2449

rabernat commented 2 years ago

Great to see progress on this. Let me know how I can help.

sgibson91 commented 2 years ago

@rabernat I think the biggest way you can help is with @jmunroe around the Columbia contract so that we can add cloud billing as a line item on invoicing. That will unblock us on two fronts:

We can setup a sustainable AWS account for this deployment (atm, the plan is to use the account that Scott graciously gave us access to, but those credits will run out eventually and then we need to start billing the Columbia grant)
We will be able to move the current GCP JupyterHub to a 2i2c-managed account and make that more sustainable too, ref: https://github.com/2i2c-org/meta/issues/279#issuecomment-1285294965

rabernat commented 2 years ago

With @yuvipanda we recently learned that Columbia AWS accounts have none of the restrictions of the GCP accounts. Anyone can get access. Does that change the calculation of the tradeoffs here?

sgibson91 commented 2 years ago

The contracting change still needs to happen for the GCP deployment. I think the fact that AWS restrictions are less is why we decided to go with this binder deployment first. But it would be nice to have a sustainable source of credits/money for it.

rabernat commented 2 years ago

What's the definition of "sustainable" here? We have about one year of funding on the Moore Foundation award left.

sgibson91 commented 2 years ago

I was just under the impression that this was supposed to be funded from that pot. If I can avoid having to do a migration between AWS projects in the future, I would prefer it.

rabernat commented 2 years ago

Sounds good 👍 . Just trying to weight the relative costs of various technical workarounds vs. the cost of amending the subaward. We have lost admin staff at Columbia recently, so our ability to execute complex budgeting actions is really degraded.

sgibson91 commented 2 years ago

I think setting up an AWS account attached to that pot of money is a quick win right now. However, when CUIT didn't respond to support our application to join the Incommon Federation, we ran out of other pathways around amending the subaward, in terms of the GCP deployment. I appreciate that it's going to take work, but 2i2c have also been trying to find a way to make working on that deployment less of a headache for a long time and have been repeatedly let down on the Columbia side of operations.

choldgraf commented 2 years ago

Hey all - I will put together a budget proposal and narrative that includes a line item for cloud costs, and see if we can get this arrangement settled quickly. If we can do this without many months or administrative slowness, then I think it would be worth it in order to reduce the stress of maintaining the infrastructure, and to give us more flexibility in access + configuration that will lead to a better service. I'll report back when we have an idea of how that process goes.

My plan will be for 2i2c to include a budget line item for expected cloud costs, this will be a conservative estimate, and we can include in our invoices the actual cloud costs as a direct pass-through.

I'll confirm with CS&S that they won't take any indirect costs on top of these cloud infrastructure costs.

2i2c-org / infrastructure