2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
103 stars 62 forks source link

New Hub: Geospatial workshop in Ghana #473

Closed choldgraf closed 3 years ago

choldgraf commented 3 years ago

Background

@paigem works with @rabernat, and is helping to lead/organize a workshop around geospatial analytics (e.g., the "Pangeo stack") in Ghana. In previous years, they have asked attendees to install things on their local machines, but she would love to have access to cloud infrastructure via 2i2c that supports this workshop.

The team behind this workshop currently does not have funding for infrastructure/services, so this would be a pro-bono case. In my opinion, it is well worth the time investment because it is a great cause, and a way to see how our infrastructure could serve those in non-North America/Europe countries.

@paigem could you help us answer some of the questions in the section below?

Setup Information

Important Information

Deploy To Do

paigem commented 3 years ago

Thanks @choldgraf for getting this underway! Very much appreciated!

I am not sure I understand a lot of what the "Setup Information" section is asking for (e.g. do I need to provide the hub type, url, etc.?), but I can fill in a bit of the "Important Information":

Important Information

For two reasons, we may want to extend the hub end date to next year or beyond:

Happy to provide more of the above information with a bit more guidance! Thank you!!

yuvipanda commented 3 years ago

This is great, we should definitely support this! Do you know which funding source we can use for this, @choldgraf?

choldgraf commented 3 years ago

@yuvipanda that's a good question, here are a few options I can think of:

choldgraf commented 3 years ago

I think we should start off using the JROST funds, and then try to find credits elsewhere

choldgraf commented 3 years ago

I wonder if @scottyhq, @consideRatio, @jhamman, or @rabernat could comment on what kind of cost we might expect for this workshop. If we have ~30-100 users doing "pangeo-style" environment analysis for 2 weeks, what kind of cost could we expect to incur in cloud infrastructure? This feels like it may be similar to the GeoHackWeeks.

choldgraf commented 3 years ago

I spoke with @rabernat who mentioned that we could use the Columbia Pangeo credits for this one. I believe that those are on GCP as well. @sgibson91 @yuvipanda is there any technical challenge to using these credits for this hub? (assuming that it will be a different hub from the "main" Pangeo hubs)

consideRatio commented 3 years ago

Hmmm hmmm @choldgraf I'm not feeling confident about cost estimation as it is so extremely dependent on how much work is generated by users on their ability to request compute via Dask-clusters, but I'll try to estimate things anyhow.

The base cost could be like any other hub for 2 weeks I guess, but then the dask worker nodes adds to that. They will be configured as spot-instances/preemptible instances that cost ~30% of original instances, so if you have for example a 32 core instances it's like 300 USD / month (150 USD / 2 weeks). I'll go ahead and guesstimate the cost wont go over 1000 USD for Dask worker nodes if ~50 users play around with dask workers and we force machines to be limited to 32 CPU cores and limit autoscaling to ~10 nodes (320 cores).

choldgraf commented 3 years ago

that's a really helpful analysis @consideRatio , thanks very much :-)

paigem commented 3 years ago

Thanks @choldgraf @consideRatio for your efforts here! With my limited understanding of all of this, I think what @consideRatio lays out here sounds very reasonable. I don't anticipate having too many high Dask-usage workloads during the school, since for many participants of the school this will be their first time using Dask or accessing large climate datasets. And especially with so many new Dask users, those CPU and scaling limits mentioned by @consideRatio will be very important.

sgibson91 commented 3 years ago

@consideRatio gave a really nice costing estimate above! πŸ™ŒπŸ» From a technical stand-point, I think we run into the same issue as https://github.com/2i2c-org/team-compass/issues/136 and we don't have billing control of that project.

paigem commented 3 years ago

Just checking on an update here! This year's workshop is coming up very soon, and I just want to know if it's likely a Hub can be set up and fully functional by July 19th at the latest, or if I should make alternate plans instead (which would be doable, as long as I know soon). Thanks!

choldgraf commented 3 years ago

@sgibson91 I believe that we can deploy this hub on Pangeo infrastructure as well, so could the temporary fix for https://github.com/2i2c-org/team-compass/issues/136 also be applied to this hub?

sgibson91 commented 3 years ago

@choldgraf we now have a bigger blocker on that project and I've resorted to testing on the GCP project that is currently hosting Pangeo infrastructure (that I can access with my 2i2c account, not Columbia)

choldgraf commented 3 years ago

@sgibson91 I think it's fine if we use whatever GCP account we have access to do serve the Ghana Hub. If worst comes to worst, we'll use our $5,000 JROST grant to pay for the cloud infrastructure.

sgibson91 commented 3 years ago

Ok, well there's a fresh cluster on the pangeo-181919 project as of today that I believe myself, @yuvipanda and @damianavila have access to. I can put my focus on this from Monday unless either of them get there before me?

choldgraf commented 3 years ago

that'd be super awesome :-)

sgibson91 commented 3 years ago

I am not sure I understand a lot of what the "Setup Information" section is asking for (e.g. do I need to provide the hub type, url, etc.?)

Hi @paigem! I think the most important questions here to get going are:

  1. What method would you like workshop attendees to log into the hub with? Such as GitHub, or Google accounts?
  2. Do you need parallel-processing capabilities provided by dask, or will a more "vanilla" setup (such as, 1 CPU) be suffice? (This helps us answer the hub type question)

We can generate a URL that will be something like foo.bar.2i2c.cloud, but if the workshop has a URL you might like to have the hub be a subdomain of that. We could add a CNAME that is something like hub.workshop-url to our records.

sgibson91 commented 3 years ago

There's a WIP PR open to deploy a Hub in https://github.com/2i2c-org/pilot-hubs/pull/508 :)

paigem commented 3 years ago

Hi @sgibson91! Thanks so much for putting this hub into action!

  1. Good question. Very few of the attendees will have a GitHub account, but I think that most would have a Google account so perhaps we should go with that. A couple questions about how this will work: (1) Will I need a list of workshop attendees and their emails beforehand to give them access, or would any attendee be able to login with their Google account? (2) Would there be an alternative (e.g. adding someone's email manually) if an attendee does not have a Google account?
  2. Ideally there would be the capacity to spin up Dask clusters, as part of the goal of this hub is to access large climate models stored on the cloud. That being said, many (probably the majority) of the tutorials for the workshop will not require Dask, so many attendees wouldn't need it and Dask would probably be used only intermittently. As an example, I would like to run some of the Pangeo tutorials (such as this one that uses 5 compute nodes or this one that can scale up to 20) that make use of Dask to do some larger computations. If we are able to use Dask clusters in this hub, then I would definitely want some user limits if that's possible.

As for the URL, we have a website powered by Wordpress (coessing.org). If we could do something like hub.coessing.org that would be nice, but I don't think it's a high priority. I.e. if it's easier to create the URL with a 2i2c address, that's perfectly fine!

Question: will I (or any admin of the hub) be able to populate everyone's hubs with tutorial notebooks? If so, would I be able to update them during the week?

I hope this helps to answer the questions remaining - please let me know if I wasn't clear or if you have any more questions!

sgibson91 commented 3 years ago

I would really appreciate @2i2c-org/tech-team stepping in here if I've misunderstood something when giving my response! πŸ˜„

(1) Will I need a list of workshop attendees and their emails beforehand to give them access, or would any attendee be able to login with their Google account? (2) Would there be an alternative (e.g. adding someone's email manually) if an attendee does not have a Google account?

  1. If we just enable authentication, then yes anyone with a Google account would be able to log in. We can restrict that to a list of attendees + admins, in which case we would need the list
  2. I believe this is possible via the Hub Admin page - would love some confirmation on that!

If we are able to use Dask clusters in this hub, then I would definitely want some user limits if that's possible.

I'm not (yet!) as familiar with the Dask chart as I am the Zero-to-JupyterHub one, but I know computational limits per user are possible in z2jh and I'd be surprised if this wasn't also configurable in the Dask chart. What kind of limits do you think would be appropriate?

As for the URL, we have a website powered by Wordpress (coessing.org). If we could do something like hub.coessing.org that would be nice, but I don't think it's a high priority. I.e. if it's easier to create the URL with a 2i2c address, that's perfectly fine!

We would definitely create the 2i2c address - it'd then be an extra step to add hub.coessing.org to that record (not a difficult/impossible one though!)

Question: will I (or any admin of the hub) be able to populate everyone's hubs with tutorial notebooks? If so, would I be able to update them during the week?

I think this is possible through either a shared folder or nbgitpuller. Using nbgitpuller would provide the benefit of reloading the content every time the link is launched, so keeping your notebooks up-to-date would be as simple as re-clicking the link. Would love to hear the team's experience setting this up!

sgibson91 commented 3 years ago

in which case we would need the list

I no longer think we, as in 2i2c engineers, need this list (I can't see any config for this in the repo). I think this is something the hub admins manage. See these docs on managing users

And here are docs on distributing content with nbgitpuller

choldgraf commented 3 years ago

Yep @sgibson91 - we need the list of the initial Hub Admin team, and then they can add users as they wish (or other admin users, even)

paigem commented 3 years ago

Thanks @sgibson91 for answering all of my questions! This all sounds great, and thanks for the links to the documentation. I'll be reading that in the next couple days to bring myself up to speed on how this all works!

What kind of limits do you think would be appropriate?

I will think on this a bit more. I will use some of the tutorials from Pangeo Gallery and so will probably set Dask limits based on those tutorials as a benchmark. Will Dask Gateway be used in this hub? I believe that on Pangeo Cloud, limits are done by the total size of a cluster, but users are able to tweak the cluster to have more workers with less memory per worker, or few workers with high memory. In general, I would be in favor of putting many Dask limits in place to guard against costly mistakes. For instance, if a Dask cluster is inactive for a certain amount of time, it should be shut down, etc.

Not sure how much control I will have over these aspects of the hub, and I know this particular hub is a bit of a test run, so it's no problem if these details aren't ironed out. However, it would be nice to know if I, as the hub admin, am able to view each user's usage of the hub, so that if there is a problem (e.g. someone keeps spinning up giant clusters for no apparent reason) I am able to somehow stop a user from continuing (e.g. by ending their session). Most of this is motivated by wanting to keep costs to a minimum and not have any accidental surges in cost.

we need the list of the initial Hub Admin team

For now, I think that's just me on the workshop side. I may add a couple others, but I'll be the main person running all of the Python content at the workshop.

Thank you again for your efforts here! I'm excited to be learning about 2i2c and how these hubs work!

sgibson91 commented 3 years ago

What kind of limits do you think would be appropriate?

I will think on this a bit more. [...]

Ok, I'll investigate what limits Pangeo have as a starting point and we can revise from there. My initial poking around of the chart docs and default settings 2i2c provide gives me confidence that this is configurable though.

I, as the hub admin, am able to view each user's usage of the hub

We could also try deploying our support chart which would provide you with grafana visualisations of real-time hub usage? https://github.com/2i2c-org/pilot-hubs/pull/456

I am able to somehow stop a user from continuing (e.g. by ending their session).

Yes, you will have a very satisfying, big, red "Stop server" button for every user from the hub admin panel :)

For now, I think that's just me on the workshop side.

That's great, could you send me a preferred Google email so I can add you please?

sgibson91 commented 3 years ago

We could also try deploying our support chart which would provide you with grafana visualisations of real-time hub usage? #456

Hah, I already provided a grafana deployment :D https://github.com/2i2c-org/pilot-hubs/blob/7bf90fba90771150a030e5df2bf3febb26b7f651/config/hubs/pangeo-hubs.cluster.yaml#L9-L18

paigem commented 3 years ago

This is all sounding great @sgibson91!! Thanks for your help!

I think using the Pangeo Dask limits is the way to go, if that is doable.

Yes, you will have a very satisfying, big, red "Stop server" button for every user from the hub admin panel :)

Very good to know! :)

Sounds like progress is being made very quickly. Is there an estimated day that the server will be up and running enough for me to start poking around?

sgibson91 commented 3 years ago

Sounds like progress is being made very quickly. Is there an estimated day that the server will be up and running enough for me to start poking around?

I think the last thing I will need to do is setup a URL for the hub - so how about I try having the hub up for you to test sometime early next week, mid-week at latest?

paigem commented 3 years ago

@sgibson91 next week is good - the earlier the better! The school starts one week from Monday, and I would like to have some time for myself and other instructors to try it out before then to make sure we've got all the packages we need. :)

Speaking of which, are we planning to use the same python environment (and packages) as Pangeo? It would be great it we could use the same as Pangeo, with at least one extra package (ecco_v4_py). Let me know if you need me to make an environment file of some sort.

sgibson91 commented 3 years ago

Hi @paigem - I'm working on deploying now! I've set the image to be pangeo-notebook, so we should have the same env as Pangeo

paigem commented 3 years ago

Oh wow!! That was quick! Thank you!

And sounds great to have the Pangeo env!

sgibson91 commented 3 years ago

Hey @paigem! The Hub is available at https://coessing.pangeo.2i2c.cloud for testing!

paigem commented 3 years ago

Amazing @sgibson91!! Thank you!! Will test it out tomorrow!

paigem commented 3 years ago

@sgibson91 the Hub is working great! So excited to be using it for our summer school!

A couple questions:

import intake
cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ecco_monthly_ds = cat.ECCOv4r3.to_dask()

But I get the following error: OSError: Forbidden: https://storage.googleapis.com/download/storage/v1/b/pangeo-ecco-eccov4r3/o/eccov4r3%2F.zmetadata?alt=media Caller does not have serviceusage.services.use access to the Google Cloud project. Is there any way we can have access to the Pangeo Cloud datasets found here?

rabernat commented 3 years ago

But I get the following error: OSError: Forbidden: https://storage.googleapis.com/download/storage/v1/b/pangeo-ecco-eccov4r3/o/eccov4r3%2F.zmetadata?alt=media Caller does not have serviceusage.services.use access to the Google Cloud project.

Just confirming, based on my experience, that this permission is needed to access requestor-pays datasets.

sgibson91 commented 3 years ago
  • Should I continue asking questions specific to my Hub in this thread, or should I start a new issue in 2i2c/pilot as mentioned in the documentation?

I think here is fine for now as making sure you're happy with things is an action on this issue and the PR hasn't been merged yet.

  • The Pangeo base environment is great. But would it be possible to add the package ecco_v4_py to all users' environments? I was able to conda install it just fine, but it would be great if all users don't have to do that.

Ok, I will look into bootstrapping the pangeo image so this package is available for you.

  • It appears that I cannot access datasets stored on Pangeo Cloud. For instance, I tried to load the ECCO dataset as I do in Pangeo Cloud:
import intake
cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ecco_monthly_ds = cat.ECCOv4r3.to_dask()

But I get the following error: OSError: Forbidden: https://storage.googleapis.com/download/storage/v1/b/pangeo-ecco-eccov4r3/o/eccov4r3%2F.zmetadata?alt=media Caller does not have serviceusage.services.use access to the Google Cloud project. Is there any way we can have access to the Pangeo Cloud datasets found here?

Ah, this looks more complicated 😬 Does that permission need to be turned on for every google account that has logged into the hub? I don't know if these requests are being made "as the user" or "as the JupyterHub". Does the @2i2c-org/tech-team have any experience with this sort of thing?

paigem commented 3 years ago

Ah, this looks more complicated 😬 Does that permission need to be turned on for every google account that has logged into the hub? I don't know if these requests are being made "as the user" or "as the JupyterHub". Does the @2i2c-org/tech-team have any experience with this sort of thing?

Yes, ideally every user would have access to these datasets. This ECCO dataset will be used for a couple labs, and so not everyone will end up needing access, but many will and I won't know ahead of time who that will be. Part of the appeal of this Hub is the ability for West Africans to be able to access and analyze large climate datasets that are stored through Pangeo Cloud, so it would be great if this is possible! Sorry if I didn't communicate that clearly before.

Another question:

paigem commented 3 years ago

Just saw your comment @rabernat - thanks for chiming in.

If these requests are being made "as the user" and will require setting each Hub user up with requester-pays access individually, then I could probably make that work if I have permissions to do that setup myself. But, if they could be done "as the JupyterHub" that would probably be preferable from my end.

damianavila commented 3 years ago

@sgibson91, maybe this one could help: https://discourse.pangeo.io/t/serviceusage-error/706

sgibson91 commented 3 years ago

@sgibson91, maybe this one could help: https://discourse.pangeo.io/t/serviceusage-error/706

Thanks @damianavila! I've run that now - @paigem could you try accessing the data again please?

paigem commented 3 years ago

@sgibson91 I just tried logging back in to the Hub and am still getting the same error message when I try to access the Pangeo cloud data.

rabernat commented 3 years ago

Our existing hubs (e.g. https://us-central1-b.gcp.pangeo.io/; https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/gcp-uscentral1b) are all configured with this permission enabled by default for all users. I think @tomaugspurger configured this, but I might be mistaken.

Edit: I think the permissions are enabled at the jupyter pod level and are part of the default credentials that everyone has when they access GCS.

rabernat commented 3 years ago

I just tried logging back in to the Hub

You probably need to explicitly shut down your notebook server and restart it to change the permissions. Just logging out won't necessarily accomplish this.

sgibson91 commented 3 years ago

Thank you for your help @rabernat! If a restart of the server doesn't help, I will poke around those hubs a little more (this hub is running in the same project for now so hopefully it should transfer over pretty easily)

paigem commented 3 years ago

I have tried explicitly shutting down the notebook kernels and restarting, and creating a new notebook, all with no luck.

sgibson91 commented 3 years ago

@paigem can you go to https://coessing.pangeo.2i2c.cloud/hub/admin, click "stop server" next to your name and try again please? (This is different to killing the kernels, it's more like rebooting your machine)

paigem commented 3 years ago

Thanks for specifying how to shut down my server @sgibson91. I have shut down my server and logged in again, and I am still getting the same error.

sgibson91 commented 3 years ago

Ok, thanks for bearing with me there. I will see what I can learn from the Pangeo hub deployments regarding this.

paigem commented 3 years ago

No problem at all! Thanks for helping figure this out!

TomAugspurger commented 3 years ago

This sounds a bit like https://github.com/pangeo-data/pangeo-cloud-federation/issues/615. I have a few comments in that thread with various commands I ran to grant GCP permissions to Google service accounts and link those Google service accounts with Kubernetes Service Accounts (which are used by the hub).

I never did confirm this, but I think there is a potential risk that a user makes requester-pays calls to non-pangeo buckets, which would end up costing money. I never found out if there's a finer-grained way to grant this permission on just certain buckets.

sgibson91 commented 3 years ago

Ok, some good news! I have bootstrapped the pangeo-notebook image so now ecco_v4_py is now available! πŸŽ‰ In the future, if you need any more packages adding, you can self-serve these in the coessing-image repository by following the instructions in the README