Closed colliand closed 1 year ago
Hi @damianavila. I am not sure if we've deployed Dask and GPU services for research hub in the shared cluster
scenario. Please let us know if there are any obstacles here.
Based on a discussion with John Clyne at JupyterCon, the NCAR-CISL hub community may benefit from a "best practices with Dask gateway" on-boarding session in early June. There are perhaps 40 NCAR researchers who want to use Dask but only a few are doing so effectively. It would be great if 2i2c (perhaps in collaboration with Dask power users at NCAR) could share some guidance.
I have re-assigned this deployment to @consideRatio. Erik, if you have any doubts about the details of this deployment, please ping @colliand for further details.
Hi @damianavila. I am not sure if we've deployed Dask and GPU services for research hub in the shared cluster scenario. Please let us know if there are any obstacles here.
@colliand @damianavila I think having a daskhub deployment in a shared clsuter is an obstacle for three reasons:
I'm not sure how we are passing through cloud costs to communities in shared clusters. If dask gateway is enabled, are we able to track cloud costs to individual communities in the shared cluster well enough? An individual user could incur cloud costs that are huge. @pnasrat do you have insights about tracking cloud costs towards communities in shared clusters?
daskhub deployments can make individual users in a community strain cluster wide infrastrucutre, and at this time I don't think we are deploying dask-gateway robustly enough to ensure that other communities in a shared dask-gateway cluster aren't influenced badly. We have seen prometheus and ingress-nginx go down in the past, and while we have learned from that, I think we should ensure we have a more robust service before trying to provide daskhub's in shared clusters.
Requests of specific hardware by communities in a shared cluster is tricky to maintain practically.
This suggested ahrdware setup assumes dask-gateway is known to be relevant, see https://github.com/2i2c-org/infrastructure/issues/2545 about me suggesting that we ensure that communities only opt for using dask-gateway if they expect workloads to need more than an individual powerful machine.
r5.xlarge
machines having 4 CPU / 32 GB memory - the default, and allowing them to be shared by users requesting a share of 1 / 2 / 4 / 8 / 16 / 32 GB of that memory, and defaulting to a share of for example 1 GB of memory initially.r5.4xlarge
machines having 16 CPU / 128 GB memory, and allowing them to be shared by users requesting a share of 4 / 8 / 16 / 32 / 64 / 128 GB of that memory, and defaulting to a share of for example 4 GB of memory initially.g4dn.2xlarge
option providing 8 CPU, 32GB, and a 16GB NVIDIA T4 Tensor Core GPU - not for use with node sharing.Hardware profile(s) and Machine types Small: t3.large Medium: t3.xlarge Large: T3.2xlarge Large w/GPU: T2.2xlarge + GPU
There are a few challenges with this request
t3 instance's and burstable operation
It seems the t3 machines are "burstable" and has a CPU credit system associated with it. We have never operated these before, and I think it could make billing even more complicated - node sharing on them is probably extremely complicated.
More details in https://aws.amazon.com/ec2/instance-types/
t2 isntance with GPU
I don't think the t2 instances can have GPU's attached.
Node sharing
The pattern of 1:1 user:node, and not using the same nodes for multiple users is costly to run in a cloud overall, and it will imply on average longer startup times for users, and its also harder to scale because of cloud quotas.
@colliand @nwehrheim the requested deployment was novel and likely require more discussion that we can manage before the target start date of June 1.
One path to move onwards quicker is with what I feel confident about we can deliver on without internal discussion:
If that is acceptable, I can start working on such deployment right away.
I'd suggest if we are concerned about isolation failures without time to fully explore workloads, in this case creating a dedicated node pool for the single use dask cluster for the hub in the shared cluster rather than a shared dask node pool orf running on the same nodepools as shared nodes. I think that should be doable with minimal additional complexity.
We track usage by namespace so billing while a little manual should be fine in the shared cluster - cf dask-staging.
I'm wondering why you are recommending dedicated AWS over dedicated GCP (which we also have) is there a specific cloud dependent issue you envisage.
Longer term we should prove any isolation fears we have against k8s but given the deadline for this request
Currently researchdelight is on its own cluster in us-west-2 and incurs a cost because it is not sharing infrastructure.
If NCAR wants a shared hub, I invite it to share the same cluster that researchdelight is on.
This will allow us to experiment with shared research clusters and reduce the operating costs of researchdelight.
@jmunroe what is the underlying why behind the shared hub, is this something NCAR actually cares about here or just something that has been suggested to them?
NCAR chose a shared over dedicated cluster based on price. Proximity to NASA data drove the choice of AWS us-west-2.
Thanks @colliand - reading the runnning notes and the comment https://github.com/2i2c-org/infrastructure/issues/2523#issuecomment-1546660449 above it seems that the initial usage of dask might be low. It probably makes sense to consider placing some guard rails in terms of the supported number of dask worker nodes in the shared hub for testing. Is that something that the community would be open to?
I believe that's a good idea. Any objections from @nwehrheim?
The NCAR team just talked about this together.
It sounds like since usage is tracked by namespace, billing won't be an issue for the shared cluster.
Use of noose sharing and the instance setup @consideRatio suggested here: https://github.com/2i2c-org/infrastructure/issues/2523#issuecomment-1551183029 seems like the way to go and provides a learning experience.
No objections with putting guardrails on the supported number of DASK worker nodes.
Thanks for the followup @nwehrheim!
Nick the jupyterhub installation is now up and available, but there are a few points I'd like your help with and feedback on.
Grant
or Request
button for the NCAR
github organization, similar to the grant button next to jupyterhub
seen in this image?
What we request here is permissions to inspect the members of various teams in the github organization NCAR
(read:org
). Specifically, JupyterHub will check if the user is a member of the GitHub org NCAR
's team 2i2c-cloud-users
to decide if the github user should be authorized access. Note that if you can't or don't want to press Grant, the login isn't expected to succeed unless your github user's membership of the github team has explicitly been made public ahead of time.ncar-cisl.2i2c.cloud
is acceptable among the choices of <anything>.2i2c.cloud
? Note that we can also make use of a domain not managed by 2i2c as an alternative to this.Thanks @consideRatio for setting this hub up! I look forward to hearing from @nwehrheim and others on the NCAR-CISL team. After we confirm that the hub is operational, we may wish to get together with @jmunroe for an onboarding conversation.
Hi @consideRatio! I authorizied 2i2c-org. We are all fine with the ncar-cisl.2i2c.cloud domain name. Thanks for the suggestion. We are also fine with the deviation and will keep you posted if we need to tweak for any reason. Thanks again!
Oh, I forgot to mention that yes, we can see Server options after the grant.
Thank you for the followup @nwehrheim!!
The GitHub handle of the community representative
nwehrheim
Hub important dates
Hub Authentication Type
GitHub (e.g., @mygithubhandle)
First Hub Administrators
[GitHub Auth only] How would you like to manage your users?
Allowing members of specific GitHub team(s)
[GitHub Teams Auth only] Profile restriction based on team membership
NCAR/2i2c-cloud-users (learned this in the Running Notes doc comment exchange with @nicholascote).
Hub logo image URL
https://www.vmcdn.ca/f/files/longmontleader/import/2017_06_ncar_highres_transparent.png
Hub logo website URL
https://www2.cisl.ucar.edu/
Hub user image GitHub repository
pending
Hub user image tag and name
pending
Extra features you would like to enable
(Optional) Preferred cloud provider
AWS
(Optional) Billing and Cloud account
None
Other relevant information to the features above
Proposed hub URL: ncar.2i2c.cloud with associated staging.ncar.2i2c.cloud hub.
Tasks to deploy the hub