[Request deployment] New Hub: NCAR-CISL for UCAR

colliand commented 1 year ago

The GitHub handle of the community representative

nwehrheim

Hub important dates

Target start date: June 1, 2023
Target end date: perhaps Oct 1, 2023 or May 30, 2023...working out details in procurement process.

Hub Authentication Type

GitHub (e.g., @mygithubhandle)

First Hub Administrators

@kcote-ncar
@NicholasCote
@nwehrheim

[GitHub Auth only] How would you like to manage your users?

Allowing members of specific GitHub team(s)

[GitHub Teams Auth only] Profile restriction based on team membership

NCAR/2i2c-cloud-users (learned this in the Running Notes doc comment exchange with @nicholascote).

Hub logo image URL

https://www.vmcdn.ca/f/files/longmontleader/import/2017_06_ncar_highres_transparent.png

Hub logo website URL

https://www2.cisl.ucar.edu/

Hub user image GitHub repository

pending

Hub user image tag and name

pending

Extra features you would like to enable

[ ] Dedicated Kubernetes cluster
[X] Scalable Dask Cluster

(Optional) Preferred cloud provider

AWS

(Optional) Billing and Cloud account

None

Other relevant information to the features above

Cluster:
- Dedicated or shared? Shared; operated by 2i2c, cloud costs passed through
- Billing account? 2i2c operates shared cluster; cloud costs will be passed through to NCAR-CISL/UCAR
Cloud compute details
- Hardware profile(s) and Machine types
  - Small: t3.large
  - Medium: t3.xlarge
  - Large: T3.2xlarge
  - Large w/GPU: T2.2xlarge + GPU
GPU? Yes
specification:
- Cloud vendor: AWS
  - Data center: US West (Oregon) us-west-2
  - Zone in data center: us-west-2a, us-west-2b

Proposed hub URL: ncar.2i2c.cloud with associated staging.ncar.2i2c.cloud hub.

Tasks to deploy the hub

[x] 1. Deploy information filled in above
[x] 2. Engineer who will deploy the hub is assigned
[x] 3. If using GitHub Orgs/Teams Auth, Engineer is given Owner rights to the org to set this up.
[x] 4. Initial Hub deployment PR
[x] 5. Administrators able to log on -> Hub now in steady-state

colliand commented 1 year ago

Hi @damianavila. I am not sure if we've deployed Dask and GPU services for research hub in the shared cluster scenario. Please let us know if there are any obstacles here.

colliand commented 1 year ago

Based on a discussion with John Clyne at JupyterCon, the NCAR-CISL hub community may benefit from a "best practices with Dask gateway" on-boarding session in early June. There are perhaps 40 NCAR researchers who want to use Dask but only a few are doing so effectively. It would be great if 2i2c (perhaps in collaboration with Dask power users at NCAR) could share some guidance.

damianavila commented 1 year ago

I have re-assigned this deployment to @consideRatio. Erik, if you have any doubts about the details of this deployment, please ping @colliand for further details.

consideRatio commented 1 year ago

Daskhub deployment in a shared cluster - yes or no?

Hi @damianavila. I am not sure if we've deployed Dask and GPU services for research hub in the shared cluster scenario. Please let us know if there are any obstacles here.

@colliand @damianavila I think having a daskhub deployment in a shared clsuter is an obstacle for three reasons:

I'm not sure how we are passing through cloud costs to communities in shared clusters. If dask gateway is enabled, are we able to track cloud costs to individual communities in the shared cluster well enough? An individual user could incur cloud costs that are huge. @pnasrat do you have insights about tracking cloud costs towards communities in shared clusters?
daskhub deployments can make individual users in a community strain cluster wide infrastrucutre, and at this time I don't think we are deploying dask-gateway robustly enough to ensure that other communities in a shared dask-gateway cluster aren't influenced badly. We have seen prometheus and ingress-nginx go down in the past, and while we have learned from that, I think we should ensure we have a more robust service before trying to provide daskhub's in shared clusters.
Requests of specific hardware by communities in a shared cluster is tricky to maintain practically.

Instance types and node sharing

My suggested hardware setup

This suggested ahrdware setup assumes dask-gateway is known to be relevant, see https://github.com/2i2c-org/infrastructure/issues/2545 about me suggesting that we ensure that communities only opt for using dask-gateway if they expect workloads to need more than an individual powerful machine.

2i2c standard "small" node option with r5.xlarge machines having 4 CPU / 32 GB memory - the default, and allowing them to be shared by users requesting a share of 1 / 2 / 4 / 8 / 16 / 32 GB of that memory, and defaulting to a share of for example 1 GB of memory initially.
2i2c standard "medium" node option with r5.4xlarge machines having 16 CPU / 128 GB memory, and allowing them to be shared by users requesting a share of 4 / 8 / 16 / 32 / 64 / 128 GB of that memory, and defaulting to a share of for example 4 GB of memory initially.
The g4dn.2xlarge option providing 8 CPU, 32GB, and a 16GB NVIDIA T4 Tensor Core GPU - not for use with node sharing.

Requested hardware setup

Hardware profile(s) and Machine types Small: t3.large Medium: t3.xlarge Large: T3.2xlarge Large w/GPU: T2.2xlarge + GPU

There are a few challenges with this request

t3 instance's and burstable operation

It seems the t3 machines are "burstable" and has a CPU credit system associated with it. We have never operated these before, and I think it could make billing even more complicated - node sharing on them is probably extremely complicated.

More details in https://aws.amazon.com/ec2/instance-types/

t2 isntance with GPU

I don't think the t2 instances can have GPU's attached.

Node sharing

The pattern of 1:1 user:node, and not using the same nodes for multiple users is costly to run in a cloud overall, and it will imply on average longer startup times for users, and its also harder to scale because of cloud quotas.

consideRatio commented 1 year ago

@colliand @nwehrheim the requested deployment was novel and likely require more discussion that we can manage before the target start date of June 1.

One path to move onwards quicker is with what I feel confident about we can deliver on without internal discussion:

Use of a dedicated cluster (AWS)
Use of node sharing and the instance setup I suggested in https://github.com/2i2c-org/infrastructure/issues/2523#issuecomment-1551170262

If that is acceptable, I can start working on such deployment right away.

pnasrat commented 1 year ago

I'd suggest if we are concerned about isolation failures without time to fully explore workloads, in this case creating a dedicated node pool for the single use dask cluster for the hub in the shared cluster rather than a shared dask node pool orf running on the same nodepools as shared nodes. I think that should be doable with minimal additional complexity.

We track usage by namespace so billing while a little manual should be fine in the shared cluster - cf dask-staging.

I'm wondering why you are recommending dedicated AWS over dedicated GCP (which we also have) is there a specific cloud dependent issue you envisage.

Longer term we should prove any isolation fears we have against k8s but given the deadline for this request

jmunroe commented 1 year ago

Currently researchdelight is on its own cluster in us-west-2 and incurs a cost because it is not sharing infrastructure.

If NCAR wants a shared hub, I invite it to share the same cluster that researchdelight is on.

This will allow us to experiment with shared research clusters and reduce the operating costs of researchdelight.

pnasrat commented 1 year ago

@jmunroe what is the underlying why behind the shared hub, is this something NCAR actually cares about here or just something that has been suggested to them?

colliand commented 1 year ago

NCAR chose a shared over dedicated cluster based on price. Proximity to NASA data drove the choice of AWS us-west-2.

pnasrat commented 1 year ago

Thanks @colliand - reading the runnning notes and the comment https://github.com/2i2c-org/infrastructure/issues/2523#issuecomment-1546660449 above it seems that the initial usage of dask might be low. It probably makes sense to consider placing some guard rails in terms of the supported number of dask worker nodes in the shared hub for testing. Is that something that the community would be open to?

colliand commented 1 year ago

I believe that's a good idea. Any objections from @nwehrheim?

nwehrheim commented 1 year ago

The NCAR team just talked about this together.

It sounds like since usage is tracked by namespace, billing won't be an issue for the shared cluster.

Use of noose sharing and the instance setup @consideRatio suggested here: https://github.com/2i2c-org/infrastructure/issues/2523#issuecomment-1551183029 seems like the way to go and provides a learning experience.

No objections with putting guardrails on the supported number of DASK worker nodes.

consideRatio commented 1 year ago

Thanks for the followup @nwehrheim!

Nick the jupyterhub installation is now up and available, but there are a few points I'd like your help with and feedback on.

Can you as part of attempting to login at https://ncar-cisl.2i2c.cloud press the Grant or Request button for the NCAR github organization, similar to the grant button next to jupyterhub seen in this image? What we request here is permissions to inspect the members of various teams in the github organization NCAR (read:org). Specifically, JupyterHub will check if the user is a member of the GitHub org NCAR's team 2i2c-cloud-users to decide if the github user should be authorized access. Note that if you can't or don't want to press Grant, the login isn't expected to succeed unless your github user's membership of the github team has explicitly been made public ahead of time.
Can you confirm that the domain name ncar-cisl.2i2c.cloud is acceptable among the choices of <anything>.2i2c.cloud? Note that we can also make use of a domain not managed by 2i2c as an alternative to this.
I've deviated a bit from the proposed node options in https://github.com/2i2c-org/infrastructure/issues/2523#issuecomment-1551170262, is this acceptable? The reason was that this was the setup in the shared k8s cluster already and it seemed reasonable to align with that.
- A "large" option is available with up to 64 CPU and 512 GB of memory
- The GPU node option is setup with a 4 CPU / 16 GB memory machine with NVIDIA T4 Tensor core 16 GB instead of a 8 CPU / 32 GB memory machine with the same GPU The setup looks like this now:
Can you check if this look correct also?

colliand commented 1 year ago

Thanks @consideRatio for setting this hub up! I look forward to hearing from @nwehrheim and others on the NCAR-CISL team. After we confirm that the hub is operational, we may wish to get together with @jmunroe for an onboarding conversation.

nwehrheim commented 1 year ago

Hi @consideRatio! I authorizied 2i2c-org. We are all fine with the ncar-cisl.2i2c.cloud domain name. Thanks for the suggestion. We are also fine with the deviation and will keep you posted if we need to tweak for any reason. Thanks again!

nwehrheim commented 1 year ago

Oh, I forgot to mention that yes, we can see Server options after the grant.

consideRatio commented 1 year ago

Thank you for the followup @nwehrheim!!

2i2c-org / infrastructure