2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

New Hub: OceanHackWeek 2021 #549

Closed abkfenris closed 3 years ago

abkfenris commented 3 years ago

Hub Description

OceanHackWeek (OHW) is a 4-day collaborative learning experience aimed at exploring, creating and promoting effective computation and analysis workflows for large and complex oceanographic data. It includes tutorials, data exploration, software development, collaborative projects and community networking.

We will be using the hub to teach tutorials and develop projects with both in-person (EST) and worldwide participants.

Community Representative

@ocefpaf

Important dates

Target start date

2021-07-28

Preferred Cloud Provider

No preference (default)

Do you have your own billing account?

Hub Authentication Type

GitHub Authentication (e.g., @mygithubhandle)

Hub logo

No response

Hub logo URL

No response

Hub image service

hub.docker.com

Hub image

uwhackweeks/oceanhackweek:28d1c7b

Extra features you'd like to enable

Hub Engineer information

The Hub Engineer should fill in the metadata below when it available. The Community Representative shouldn't worry about this section, but may be asked to provide help answering some questions.

Deployment information

Hub ID: ohw

Hub Cluster: pilot

Hub URL: ohw.pilot.2i2c.cloud

Hub Template: daskhub

Actions to deploy

ocefpaf commented 3 years ago

@choldgraf please let us know when/how we can test it. (Folks are getting anxious to run pre-test their tutorials on the hub.)

choldgraf commented 3 years ago

Sounds good - will try and deploy the hub tomorrow. (We are all on a European time zone currently)

choldgraf commented 3 years ago

(also just to clarify, the target start date listed was the 28th, do you need the hub earlier than this?)

ocefpaf commented 3 years ago

(also just to clarify, the target start date listed was the 28th, do you need the hub earlier than this?)

If we can get it on the 27th, tomorrow, it would be nice so we can make the instructors test their notebooks against it. The 28th would be tight but it works too.

choldgraf commented 3 years ago

Not quite ready to close this yet! We need confirmation from @ocefpaf that all seems well :-)

@ocefpaf see the hub URL above (https://ohw.pilot.2i2c.cloud/) and confirm you can log in etc!

ocefpaf commented 3 years ago

Awesome! I was able to login (super fast) and I'll play with it ASAP. I'll probably return with tons of questions. I'll try to read the docs first ;-p

ocefpaf commented 3 years ago

@choldgraf first question, and a simple one, how can I add/auth people to login?


Edit: Sorry, read the docs and doing it now.

choldgraf commented 3 years ago

I hope the lack of extra questions means you figured out how to do stuff as an admin, and not that things have gone down in flaming glory 😬🔥

also I added @GeorgianaElena on this one to track who is working on this hub deploy!

ocefpaf commented 3 years ago

I hope the lack of extra questions means you figured out how to do stuff as an admin, and not that things have gone down in flaming glory

Yes. Surprisingly easy to manage so far. Great work! I'll experiment with adding package today.

also I added @GeorgianaElena on this one to track who is working on this hub deploy!

Good to know. Thanks @GeorgianaElena!

choldgraf commented 3 years ago

@ocefpaf is this hub now ready to go from your end? we'd like to close out this issue if all looks OK

ocefpaf commented 3 years ago

Yes. Please close it. I cannot because @abkfenris created it. Any comments/feedback @abkfenris ?

abkfenris commented 3 years ago

We have some late breaking issues with Dask, but that may be a package we need in our image.

yuvipanda commented 3 years ago

You can experimentally change the image deployed to your hub at https://ohw.pilot.2i2c.cloud/services/configurator/. After building and pushing your image, try the new image tag there? Some preliminary docs at https://pilot.2i2c.org/en/latest/admin/howto/configurator.html

abkfenris commented 3 years ago

Ya, we've been playing with adjusting the image in configurator as we get requests for new packages. I think we were missing dask-gateway and distributed, which I'm building an image for now. Anything else that we may be missing from our environment?

It would be sweet if there was a webhook endpoint for the configurator we could use to adjust the image, or if we could do gitops-ish things against https://github.com/2i2c-org/pilot-hubs/blob/cc71cbd47bf79c90e96a86d2983bfaed51ba3703/config/hubs/2i2c.cluster.yaml#L108-L110

abkfenris commented 3 years ago

After

from dask_gateway import GatewayCluster
cluster = GatewayCluster()
cluster.scale(4)

it can take about 5 min to scale up since we basically conda install * in our image.

I've done some work trying to slim down the image (it's 5.5 GB now), but it's mainly the variety of conda packages that our tutorials or dask users may need.

The other way to speed things up would be to have images closer to the hub. From poking around the repo, that looks like it zone us-central1-b right?

Does 2i2c have a Google Artifact/Container Registry that we could push images too? I'm also inquiring about if we have access to a Google Cloud project that we could access to run one ourselves.

ocefpaf commented 3 years ago

@GeorgianaElena I'm getting a dead kernel when I try to load the dataset in the last line of this notebook:

https://nbviewer.jupyter.org/gist/ocefpaf/d9253a4dcd74ee651bf55598044d9cf1

Everything works OK in a fresh pull of our image locally.

yuvipanda commented 3 years ago

@ocefpaf I'm guessing that's because you don't have enough RAM. Do you have a sense of how much RAM your notebooks might need? I think the default is pretty small (1G) and that might be it?

I'm bumping it up to 4G for the duration of the workshop - turn your server on / off and give it a shot?

ocefpaf commented 3 years ago

I'm bumping it up to 4G for the duration of the workshop - turn your server on / off and give it a shot?

4G sound reasonable. I'm testing it and I'll get back to you.

ocefpaf commented 3 years ago

I'm testing it and I'll get back to you.

That did fix the issue with r-keras dataset loading. Thanks! However, @jbusecke may require more. He is testing and will comment here ASAP.

jbusecke commented 3 years ago

Hey everyone, if 8GB would be possible that would be amazing. I just tried my notebook on the hub and it crashes. Works fine on the pangeo staging deployment with 8GB.

abkfenris commented 3 years ago

We used an 8 GB limit last year and a 7 GB guarantee last year.

https://github.com/oceanhackweek/jupyterhub/blob/d6eef3fc131ca04fff1eb6cf22ba8a1263415bc6/deployments/ohw-hub/config/common.yaml#L20-L22

yuvipanda commented 3 years ago

@jbusecke @abkfenris I bumped it to 7G/8G. Try it now?

jbusecke commented 3 years ago

Yes, that works like a charm. Many thanks.

yuvipanda commented 3 years ago

@abkfenris for dask-gateway performance, I am going to do the following:

  1. Provision a new node pool for notebook and dask pods, specifically for ohw
  2. Enable the image pre-puller for both the node pools. This requires us to specify the image in the repo, and not just via the configurator. So give me a ping here once the image tag stabilizes?
  3. Use a node placeholder to keep a minimum of a few nodes of headroom for the dask pods and the notebook pods

This should help with bringing up new dask clusters much faster.

abkfenris commented 3 years ago

Awesome, thank you.

Hopefully our Dask workers won't need to evolve too much from: ghcr.io/oceanhackweek/jupyer-image:0be4cbe

yuvipanda commented 3 years ago

@abkfenris can you tell me how much RAM you usually specify for your dask workers? Will help size the node too.

abkfenris commented 3 years ago

Last year I believe we used the Pangeo binder to teach the Dask section, but I would probably size them the same way.

It probably makes more sense to people that each worker is the same size as their current server, then it's a 'oh if I can do X much work in Y time on one machine, if I ask for Z workers, I can do X * Z work in Y time`.

yuvipanda commented 3 years ago

I have to head off to bed now, and will complete the following tasks tomorrow:

  1. Set the dask worker memory request / limit to match what we provide the notebook pods
  2. Update the image tag in this repo to what you provided
  3. Enable the continuous pre-puller for both the dask and notebook worker nodes
  4. Enable a node placeholder, so we keep a few 'warm' nodes around for the workshop
  5. Possibly resize the nodes to make scale-up faster

Do you have a sense of how many users you are expecting?

abkfenris commented 3 years ago

I believe we have around 75 participants, and about 25 instructors/helpers. There is some spread around the world, so some folks may not be working concurrently, but from 2 - 5 PM EST, the most folks will be signed on for the synchronous tutorials.

Last year we tried to badger folks to shutdown after the tutorials, even if they were planning on working on projects right afterwards, to give the cluster a chance to scale down, and re-bin pack more efficiently.

yuvipanda commented 3 years ago

Awesome. I'll size things appropriately.

I did a quota check, and noticed that we might not have enough IP address quota nor SSD disk quota to spin up more than 8 nodes! I've increased a bump to 64 nodes for now, hopefully that gets approved quickly. If not, we can just spin up big nodes! This will also be helped in the near term with #538

yuvipanda commented 3 years ago

I also checked the size of our NFS volume:

$ gcloud compute ssh nfs-server-01 --zone=us-central1-b
$ df -h
/dev/sdb        100G   71G   30G  71% /export/home-01

I'll probably expand it to 200G

abkfenris commented 3 years ago

Sweet, thank you.

If only FireStore didn't start with a min size of a TB.

yuvipanda commented 3 years ago

The quota increases have been approved! \o/ I tested scaling up the node pool though, and found we need to increase CPU if we wanted more than 30 or so nodes. That should be enough for your workshop, but I asked for an increase anyway!

yuvipanda commented 3 years ago

It was immediately approved, so yay

yuvipanda commented 3 years ago

I increased the size of the NFS volume by:

  1. Manually increase the size of the volume via the console - https://console.cloud.google.com/compute/disksDetail/zones/us-central1-b/disks/low-touch-hubs-home-01?project=two-eye-two-see
  2. Grow the XFS volume holding the home directories

    $ gcloud compute ssh nfs-server-01 --zone=us-central1-b
    $ sudo xfs_growfs  /export/home-01/

This has grown the volume to 300G, which is good enough for now.

ocefpaf commented 3 years ago

Hopefully our Dask workers won't need to evolve too much from: ghcr.io/oceanhackweek/jupyer-image:0be4cbe

@jbusecke just made a change to accommodate the latest dev install of cmip6_preprocessing. So let's go with 9efd4fb instead. Hopefully that was the last change :sweat_smile:

yuvipanda commented 3 years ago

Unfortunately I won't be able to set up the node placeholders until later today. The quotas and stuff are set up tho, and I tested that we can scale up to at least 50 nodes

abkfenris commented 3 years ago

A slow startup on the first day will help drive the point in that folks should log in early.

I think getting crazy with Dask doesn't happen until the visualization session tomorrow, but we haven't structured our schedule around which exact packages/resources are getting used by what tutorial.

ocefpaf commented 3 years ago

Folks, we are hitting an odd issue. There is a data source, very common for oceanographic data, named OPeNDAP. It works locally on the same docker image, exactly the same packages but it fails in the jupyterhub. The steps to reproduce are:

from netCDF4 import Dataset
url = "http://goosbrasil.org:8080/pirata/B19s34w.nc"  # any OPenDAP URL will fail with an odd curl error.
nc = Dataset(url)

Any advice on how we can even debug this?

abkfenris commented 3 years ago

Hmm, if I try to use r = request.get("http://goosbrasil.org:8080/pirata/B19s34w.nc") I get a Max reties exceeded & TimeoutErrors.

abkfenris commented 3 years ago

If I try from our OpenDAP server (which I have never actually used in anger before), that works for me:

ds = xr.open_dataset("http://www.neracoos.org/opendap/A0143/A0143.met.realtime.nc")
ds
nc = Dataset("http://www.neracoos.org/opendap/A0143/A0143.met.realtime.nc")
nc
yuvipanda commented 3 years ago

@abkfenris it's possible that port 8080 outbound is turned off, let me investigate

yuvipanda commented 3 years ago

@abkfenris @ocefpaf there was an outbound port restriction. I opened port 8080 and 22 (https://github.com/2i2c-org/pilot-hubs/pull/576), and this seems to work now.

yuvipanda commented 3 years ago

ok so I've setup node placeholders (PR coming soon) to have 2 spare notebook nodes and 3 spare dask worker nodes, with the images pre-pulled. Can you test out dask spinup time now?

ocefpaf commented 3 years ago

Thanks so much Yuvi!

yuvipanda commented 3 years ago

@ocefpaf yw! How was the dask-gateway spinup time?

ocefpaf commented 3 years ago

How was the dask-gateway spinup time?

I did not test it myself but the projects will start today and folks will report how it goes. I'll be sure to get back to you as soon as we know.

PS: Quick question. What is the best practice to allow folks to create conda environments in the hub? Giving them permission to write at /srv/conda does not sound like a good idea :-/

yuvipanda commented 3 years ago

Giving them permission to write at /srv/conda does not sound like a good idea :-/

This is actually my preferred method - repo2docker does this too. Putting it in $HOME is probably just going to be super slow thanks to NFS. If their container goes wonky, they can simply restart the server. It won't persist past restarts though :(

ocefpaf commented 3 years ago

Good to know that not all my ideas are bad :smile:

(I tired and it worked. Thanks!)

BTW, we have two OPeNDAP server in our demo that use 808 port. One worked, the other one ("http://goosbrasil.org:8080/pirata/B19s34w.nc") still times out. Not sure if that is a problem with the server or the hub. It does work locally for me. However, this is not a pressing issue and do not worry too much about it unless it is an easy fix.

yuvipanda commented 3 years ago

@ocefpaf I can't access http://goosbrasil.org:8080/pirata/B19s34w.nc from my local computer either - it just hands and times out. Maybe it's restricted to specific networks if it works for yu?