Closed abkfenris closed 3 years ago
@choldgraf please let us know when/how we can test it. (Folks are getting anxious to run pre-test their tutorials on the hub.)
Sounds good - will try and deploy the hub tomorrow. (We are all on a European time zone currently)
(also just to clarify, the target start date listed was the 28th, do you need the hub earlier than this?)
(also just to clarify, the target start date listed was the 28th, do you need the hub earlier than this?)
If we can get it on the 27th, tomorrow, it would be nice so we can make the instructors test their notebooks against it. The 28th would be tight but it works too.
Not quite ready to close this yet! We need confirmation from @ocefpaf that all seems well :-)
@ocefpaf see the hub URL above (https://ohw.pilot.2i2c.cloud/) and confirm you can log in etc!
Awesome! I was able to login (super fast) and I'll play with it ASAP. I'll probably return with tons of questions. I'll try to read the docs first ;-p
@choldgraf first question, and a simple one, how can I add/auth people to login?
Edit: Sorry, read the docs and doing it now.
I hope the lack of extra questions means you figured out how to do stuff as an admin, and not that things have gone down in flaming glory 😬🔥
also I added @GeorgianaElena on this one to track who is working on this hub deploy!
I hope the lack of extra questions means you figured out how to do stuff as an admin, and not that things have gone down in flaming glory
Yes. Surprisingly easy to manage so far. Great work! I'll experiment with adding package today.
also I added @GeorgianaElena on this one to track who is working on this hub deploy!
Good to know. Thanks @GeorgianaElena!
@ocefpaf is this hub now ready to go from your end? we'd like to close out this issue if all looks OK
Yes. Please close it. I cannot because @abkfenris created it. Any comments/feedback @abkfenris ?
We have some late breaking issues with Dask, but that may be a package we need in our image.
You can experimentally change the image deployed to your hub at https://ohw.pilot.2i2c.cloud/services/configurator/. After building and pushing your image, try the new image tag there? Some preliminary docs at https://pilot.2i2c.org/en/latest/admin/howto/configurator.html
Ya, we've been playing with adjusting the image in configurator as we get requests for new packages. I think we were missing dask-gateway
and distributed
, which I'm building an image for now. Anything else that we may be missing from our environment?
It would be sweet if there was a webhook endpoint for the configurator we could use to adjust the image, or if we could do gitops-ish things against https://github.com/2i2c-org/pilot-hubs/blob/cc71cbd47bf79c90e96a86d2983bfaed51ba3703/config/hubs/2i2c.cluster.yaml#L108-L110
After
from dask_gateway import GatewayCluster
cluster = GatewayCluster()
cluster.scale(4)
it can take about 5 min to scale up since we basically conda install *
in our image.
I've done some work trying to slim down the image (it's 5.5 GB now), but it's mainly the variety of conda packages that our tutorials or dask users may need.
The other way to speed things up would be to have images closer to the hub. From poking around the repo, that looks like it zone us-central1-b
right?
Does 2i2c have a Google Artifact/Container Registry that we could push images too? I'm also inquiring about if we have access to a Google Cloud project that we could access to run one ourselves.
@GeorgianaElena I'm getting a dead kernel when I try to load the dataset in the last line of this notebook:
https://nbviewer.jupyter.org/gist/ocefpaf/d9253a4dcd74ee651bf55598044d9cf1
Everything works OK in a fresh pull of our image locally.
@ocefpaf I'm guessing that's because you don't have enough RAM. Do you have a sense of how much RAM your notebooks might need? I think the default is pretty small (1G) and that might be it?
I'm bumping it up to 4G for the duration of the workshop - turn your server on / off and give it a shot?
I'm bumping it up to 4G for the duration of the workshop - turn your server on / off and give it a shot?
4G sound reasonable. I'm testing it and I'll get back to you.
I'm testing it and I'll get back to you.
That did fix the issue with r-keras dataset loading. Thanks! However, @jbusecke may require more. He is testing and will comment here ASAP.
Hey everyone, if 8GB would be possible that would be amazing. I just tried my notebook on the hub and it crashes. Works fine on the pangeo staging deployment with 8GB.
We used an 8 GB limit last year and a 7 GB guarantee last year.
@jbusecke @abkfenris I bumped it to 7G/8G. Try it now?
Yes, that works like a charm. Many thanks.
@abkfenris for dask-gateway performance, I am going to do the following:
This should help with bringing up new dask clusters much faster.
Awesome, thank you.
Hopefully our Dask workers won't need to evolve too much from: ghcr.io/oceanhackweek/jupyer-image:0be4cbe
@abkfenris can you tell me how much RAM you usually specify for your dask workers? Will help size the node too.
Last year I believe we used the Pangeo binder to teach the Dask section, but I would probably size them the same way.
It probably makes more sense to people that each worker is the same size as their current server, then it's a 'oh if I can do X much work in Y time on one machine, if I ask for Z workers, I can do X * Z work in Y time`.
I have to head off to bed now, and will complete the following tasks tomorrow:
Do you have a sense of how many users you are expecting?
I believe we have around 75 participants, and about 25 instructors/helpers. There is some spread around the world, so some folks may not be working concurrently, but from 2 - 5 PM EST, the most folks will be signed on for the synchronous tutorials.
Last year we tried to badger folks to shutdown after the tutorials, even if they were planning on working on projects right afterwards, to give the cluster a chance to scale down, and re-bin pack more efficiently.
Awesome. I'll size things appropriately.
I did a quota check, and noticed that we might not have enough IP address quota nor SSD disk quota to spin up more than 8 nodes! I've increased a bump to 64 nodes for now, hopefully that gets approved quickly. If not, we can just spin up big nodes! This will also be helped in the near term with #538
I also checked the size of our NFS volume:
$ gcloud compute ssh nfs-server-01 --zone=us-central1-b
$ df -h
/dev/sdb 100G 71G 30G 71% /export/home-01
I'll probably expand it to 200G
Sweet, thank you.
If only FireStore didn't start with a min size of a TB.
The quota increases have been approved! \o/ I tested scaling up the node pool though, and found we need to increase CPU if we wanted more than 30 or so nodes. That should be enough for your workshop, but I asked for an increase anyway!
It was immediately approved, so yay
I increased the size of the NFS volume by:
Grow the XFS volume holding the home directories
$ gcloud compute ssh nfs-server-01 --zone=us-central1-b
$ sudo xfs_growfs /export/home-01/
This has grown the volume to 300G, which is good enough for now.
Hopefully our Dask workers won't need to evolve too much from: ghcr.io/oceanhackweek/jupyer-image:0be4cbe
@jbusecke just made a change to accommodate the latest dev install of cmip6_preprocessing. So let's go with 9efd4fb
instead. Hopefully that was the last change :sweat_smile:
Unfortunately I won't be able to set up the node placeholders until later today. The quotas and stuff are set up tho, and I tested that we can scale up to at least 50 nodes
A slow startup on the first day will help drive the point in that folks should log in early.
I think getting crazy with Dask doesn't happen until the visualization session tomorrow, but we haven't structured our schedule around which exact packages/resources are getting used by what tutorial.
Folks, we are hitting an odd issue. There is a data source, very common for oceanographic data, named OPeNDAP. It works locally on the same docker image, exactly the same packages but it fails in the jupyterhub. The steps to reproduce are:
from netCDF4 import Dataset
url = "http://goosbrasil.org:8080/pirata/B19s34w.nc" # any OPenDAP URL will fail with an odd curl error.
nc = Dataset(url)
Any advice on how we can even debug this?
Hmm, if I try to use r = request.get("http://goosbrasil.org:8080/pirata/B19s34w.nc")
I get a Max reties exceeded & TimeoutErrors.
If I try from our OpenDAP server (which I have never actually used in anger before), that works for me:
ds = xr.open_dataset("http://www.neracoos.org/opendap/A0143/A0143.met.realtime.nc")
ds
nc = Dataset("http://www.neracoos.org/opendap/A0143/A0143.met.realtime.nc")
nc
@abkfenris it's possible that port 8080 outbound is turned off, let me investigate
@abkfenris @ocefpaf there was an outbound port restriction. I opened port 8080 and 22 (https://github.com/2i2c-org/pilot-hubs/pull/576), and this seems to work now.
ok so I've setup node placeholders (PR coming soon) to have 2 spare notebook nodes and 3 spare dask worker nodes, with the images pre-pulled. Can you test out dask spinup time now?
Thanks so much Yuvi!
@ocefpaf yw! How was the dask-gateway spinup time?
How was the dask-gateway spinup time?
I did not test it myself but the projects will start today and folks will report how it goes. I'll be sure to get back to you as soon as we know.
PS: Quick question. What is the best practice to allow folks to create conda environments in the hub? Giving them permission to write at /srv/conda
does not sound like a good idea :-/
Giving them permission to write at
/srv/conda
does not sound like a good idea :-/
This is actually my preferred method - repo2docker does this too. Putting it in $HOME is probably just going to be super slow thanks to NFS. If their container goes wonky, they can simply restart the server. It won't persist past restarts though :(
Good to know that not all my ideas are bad :smile:
(I tired and it worked. Thanks!)
BTW, we have two OPeNDAP server in our demo that use 808 port. One worked, the other one ("http://goosbrasil.org:8080/pirata/B19s34w.nc"
) still times out. Not sure if that is a problem with the server or the hub. It does work locally for me. However, this is not a pressing issue and do not worry too much about it unless it is an easy fix.
@ocefpaf I can't access http://goosbrasil.org:8080/pirata/B19s34w.nc
from my local computer either - it just hands and times out. Maybe it's restricted to specific networks if it works for yu?
Hub Description
OceanHackWeek (OHW) is a 4-day collaborative learning experience aimed at exploring, creating and promoting effective computation and analysis workflows for large and complex oceanographic data. It includes tutorials, data exploration, software development, collaborative projects and community networking.
We will be using the hub to teach tutorials and develop projects with both in-person (EST) and worldwide participants.
Community Representative
@ocefpaf
Important dates
Target start date
2021-07-28
Preferred Cloud Provider
No preference (default)
Do you have your own billing account?
Hub Authentication Type
GitHub Authentication (e.g., @mygithubhandle)
Hub logo
No response
Hub logo URL
No response
Hub image service
hub.docker.com
Hub image
uwhackweeks/oceanhackweek:28d1c7b
Extra features you'd like to enable
Hub Engineer information
The Hub Engineer should fill in the metadata below when it available. The Community Representative shouldn't worry about this section, but may be asked to provide help answering some questions.
Deployment information
Hub ID:
ohw
Hub Cluster:
pilot
Hub URL:
ohw.pilot.2i2c.cloud
Hub Template:
daskhub
Actions to deploy