2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
103 stars 63 forks source link

[Request deployment] New Hub: Climatematch Academy #2524

Closed colliand closed 1 year ago

colliand commented 1 year ago

The GitHub handle of the community representative

@abodner

Hub important dates

Target start date: 2023-06-01 Target end date: 2023-08-31

Heavy usage will take place during the course. The course will run July 17-28 2023.

Hub Authentication Type

GitHub (e.g., @mygithubhandle)

First Hub Administrators

[GitHub Auth only] How would you like to manage your users?

Allowing members of specific GitHub team(s)

[GitHub Teams Auth only] Profile restriction based on team membership

pending

Abigail, can you please point to the GitHub team that Climatematch will use to manage user access to the hub?

Hub logo image URL

https://lh6.googleusercontent.com/pK1Zrf_NmWJ5KqhFB___4p8HPTf4D6u2om5UQkJbVQcwGjDSwlELPibkFfqW809chxybGrQwgiln8v0fRC00fYGzrsb6vIfFtsbh6PetpJKrk_UPoUb-4-RAH6ibtpXyxQ=w1280

Hub logo website URL

https://academy.climatematch.io/

Hub user image GitHub repository

pending, likely best to use latest pangeo image

Hub user image tag and name

pending; likely latest pangeo image

Extra features you would like to enable

(Optional) Preferred cloud provider

AWS

(Optional) Billing and Cloud account

None

Other relevant information to the features above

Climatematch Academy will train a cohort of ~1000 students in computational methods for climate science. The academy is partly inspired by Pangeo and builds on a similar virtual school in Neuroscience created and operated by Neuromatch.

Tasks to deploy the hub

damianavila commented 1 year ago

I have re-assigned this deployment to @yuvipanda. Yuvi, if you have any doubts about the details of this deployment, please ping @colliand for further details.

WesleyTheGeolien commented 1 year ago

Hey,

Nice to meet you all finally (and virtually), I have seen your work in the past and I am a big fan you have done great things! Congrats!

The GitHub team to use is: https://github.com/orgs/ClimateMatchAcademy/teams/2023students Is it possible to use 2 teams with different rights? Students we trust less but trust more teaching assistants: https://github.com/orgs/ClimateMatchAcademy/teams/2023teachingassistants I am unsure if we could / would allow different quotas based on the team a user is in? If this can not be done then please just use the students team.

As for the Docker image. I will be building a new image based on the pang image, we have a few extra packages / pip installs of custom packages that need adding. I will get this done by Monday next week and will post you the image. I will have the docker file and I am thinking of pushing to docker hub (I know there is a quota for free accounts / orgs (100 or 200 pulls? From memory). Is there a different service you would need setup? I believe pangeo use Quay.io http://quay.io/? Yeah not quite sure what you need but I am sure I can provide either the image or docker file.

Best regards,

Weley Banfield

On 17 May 2023, at 11:49, Damian Avila @.***> wrote:

I have re-assigned this deployment to @yuvipanda https://github.com/yuvipanda. Yuvi, if you have any doubts about the details of this deployment, please ping @colliand https://github.com/colliand for further details.

— Reply to this email directly, view it on GitHub https://github.com/2i2c-org/infrastructure/issues/2524#issuecomment-1551088117, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJ3GYCZI3G3GUV7PJFXWTLXGSNJ5ANCNFSM6AAAAAAXU5ARXE. You are receiving this because you were mentioned.

yuvipanda commented 1 year ago

Glad to work with you, @WesleyTheGeolien! Yes, we prefer you use quay.io rather than dockerhub! Let us know once the repo + image are setup :)

WesleyTheGeolien commented 1 year ago

Great,

Will do @yuvipanda,quick question will it always pull the latest? Eg. I post an image then realise I need some extra dependency so build and push a new image (potentially with some tag but the same tag I give you) will that auto update the hub (bearing in mind some time to propagate?)

In previous projects I have used watchtower I don't know if your setup uses something similar?

yuvipanda commented 1 year ago

@WesleyTheGeolien if your hub uses only one image, you will be able to self-configure it as an admin to pull whatever tag you want. We prefer to not use the 'latest' tag, but have the admins change tags when necessary via UI.

yuvipanda commented 1 year ago

And re: teams, let's just start with allowing access to the students team and see if that is enough?

WesleyTheGeolien commented 1 year ago

@yuvipanda ok sounds good:

WesleyTheGeolien commented 1 year ago

Hi @yuvipanda

So I have setup our ci to build docker image and currently push to my personal dockerhub: https://hub.docker.com/r/wesleyban/climatematch-notebook

We are looking at changing this to quay.io and associating with climatematch so it is succeptible to change in the coming days/ weeks, sorry for the hassle.

if needed the dockerfile can be found here: https://github.com/ClimateMatchAcademy/course-content/blob/docker/Dockerfile (currently on docker branch but will be merged into main)

yuvipanda commented 1 year ago

@WesleyTheGeolien thanks! I realize the GCP vs AWS question hasn't been resolved. What kinda data would you be using this with? My inclination is to put this on GCP as that is where our existing shared cluster lives. Any objections?

WesleyTheGeolien commented 1 year ago

@yuvipanda I don't know if you are authorized to say but it would be the "same" or similar datasets to Pangeo, I am not sure where they host?

I guess the main issues is around data access to Climate data sets in the cloud and not having to pay network egress fees.

Otherwise I have uploaded "small" datasets to OSF -> Climatematch not sure how that would integrate ?

Also the questions about does AWS / GCP allow connections from all countries? We have a substantial amount of students in Iran and China for example would this cause a problem on either of the platforms? If so I guess we choose the other platform!

I have canvassed my team members and will get back with the list of cloud hosted resources we are using.

yuvipanda commented 1 year ago

similar datasets to Pangeo

Unfortunately this is too broad :( All the current pangeo related hubs (including m2lines) are hosted on GCP, so maybe if that works, this is fine?

I guess the main issues is around data access to Climate data sets in the cloud and not having to pay network egress fees.

Note that network egress fees aren't paid by you, but by the agency hosting the data.

I have canvassed my team members and will get back with the list of cloud hosted resources we are using.

This would very much help!

WesleyTheGeolien commented 1 year ago

In that case if all Pangeo is hosted on GCP I think that is fine, please confirm @abodner.

Ahh I thought the egress charges were paid by the hub, that is somewhat a win then!

Here is a list of current datasets being used:

yuvipanda commented 1 year ago

@WesleyTheGeolien picking this back up,

We have a substantial amount of students in Iran and China for example would this cause a problem on either of the platforms? If so I guess we choose the other platform!

Unfortunately this is totally out of our control, and afaik both cloud platforms are the same here (blocked in Iran, accessible in China).

yuvipanda commented 1 year ago

@WesleyTheGeolien and just to confirm (because you mention use with m2lines), you are not planning on using dask-gateway with this hub?

abodner commented 1 year ago

Correct @yuvipanda, we are not planning to use dask!

yuvipanda commented 1 year ago

@WesleyTheGeolien @abodner check out https://climatematch.2i2c.cloud!

Test it out and lmk how it goes?

colliand commented 1 year ago

Thanks @yuvipanda. FYI @abodner, the ClimateMatch Academy hub is available for testing here: https://climatematch.2i2c.cloud/

yuvipanda commented 1 year ago

@abodner @WesleyTheGeolien if you'd like this to be at hub.climatematch.io, please add a CNAME record pointing hub.climatematch.io to climatematch.2i2c.cloud. I'd like us to keep the staging domain under 2i2c.cloud if that's ok though.

abodner commented 1 year ago

All sounds good. This is very exciting! Thanks all for being so quick!

abodner commented 1 year ago

@yuvipanda the logo is not ours. I have shared ours in the past but can provide another file.

It would be great if students did not have to have the additional github grant access step. I am happy to give you admin rights if that can be spared from students.

yuvipanda commented 1 year ago

@abodner ah yes please do provide a URL to a logo I can use! The logo link in this GitHub issue doesn't work :(

And yes, the 'grant' step only needs to happen the very first time. Please grant me admin access, I'll do it and then we can remove my access.

abodner commented 1 year ago

Thanks @yuvipanda you should have admin access now. Let me know when you are finished please, I'd like to limit the number of admins on our side.

abodner commented 1 year ago

@yuvipanda here is a new link to our logo: https://drive.google.com/file/d/1ASKF7CwfkLYWsGjkMrgkvyRNbEdMCrMN/view?usp=sharing

yuvipanda commented 1 year ago

@abodner you can remove my access now, all good now. You should try to get someone with just student team access to login to make sure it works, but it should.

I don't think we can link directly to the google drive link :( Is it already on your website or somewhere we can directly include as an <img> tag maybe?

abodner commented 1 year ago

Thanks @yuvipanda. Can I send you the png for now? We use google sites and I am not sure the logo is stored in a very clever way. CMA_logo_text_transparent

yuvipanda commented 1 year ago

@abodner hmm I'll poke around with it tomorrow if that's ok!

Do test out the memory available to see if that works or we need to increase it!

abodner commented 1 year ago

Sounds great, thanks @yuvipanda ! Are all datasets @WesleyTheGeolien provided available already?

yuvipanda commented 1 year ago

Ah, I haven't done anything related to those. I though those are all externally provided (by NOAA or GCP or similar) and don't need anything done on our end. Can you verify that, @WesleyTheGeolien?

yuvipanda commented 1 year ago

@abodner I've fixed the logo, check it out.

I'll wait to hear from @WesleyTheGeolien about datasets.

WesleyTheGeolien commented 1 year ago

Ahh sorry everyone somehow missed these notifications.

Hey @yuvipanda so we had some questions around data on the hub. We use publicly hosted cloud datasets, from my understanding these are fine to interact with (without egress charges) (with the potential caveat of needing to be on the same region as they are hosted). However we also have some other data sets (roughly 20 GB hosted on osf as well as a 50ish Gb data set we are still unsure on what to do with.

I think pulling this data from every student on the hub seems a bit redundant? Is there a way to cache data / add data to the Hub? I saw some s3 connectivity in the jupyter lab interface? Just wondering on what the best practices are for getting data up there? (I assume baking it into the Docker image is a bad idea -> we don't really want 100gb images ...)

yuvipanda commented 1 year ago

@WesleyTheGeolien there is a 'shared-readwrite' directory available that admins can put datasets in, and it is available in a readonly fashion under the 'shared/' directory for everyone else. Think that can work out?

WesleyTheGeolien commented 1 year ago

Thanks @yuvipanda that should work out.

Another quick question I have someone testing the hub. From my understanding each user has a provision of ~12Gb of Ram but at the bottom (near the left) of the screen it says 2Gb, and they are complaining that loading a 800mb file into memory is crashing the hub. Is this expected?

cheers

image

colliand commented 1 year ago

The climatematch logo is not rendering as the splash image on the login page: https://climatematch.2i2c.cloud/hub/login. FYI @yuvipanda.

yuvipanda commented 1 year ago

@WesleyTheGeolien as i mentioned in https://github.com/2i2c-org/infrastructure/issues/2524#issuecomment-1572520388, I actually have provided only 2G of RAM right now. m2lines 'small' profile is about 7GB - want me to bump that up?

WesleyTheGeolien commented 1 year ago

Ahh thanks @yuvipanda I didn't see that, yep we are getting crashes when running our tutorials so bumping to 7gb would be great, out of interest are these arbitary values or set steps?

yuvipanda commented 1 year ago

@WesleyTheGeolien alright, bumped now ain https://github.com/2i2c-org/infrastructure/pull/2665!

yuvipanda commented 1 year ago

@WesleyTheGeolien @abodner I'm going to close this issue now, as the hub is up and running. Please email support@2i2c.org if you have any more issues! And definitely let us know at least 2 weeks before any major events with information on how many people you expect, so we can size up your nodes accordingly.