[New Hub] LINC (MIT Brain)

yuvipanda commented 8 months ago

Copied over from https://github.com/2i2c-org/meta/issues/913

Process Note

I'm using this as a way to try to rejig our new hub request process. See https://github.com/2i2c-org/meta/issues/897 (particularly https://github.com/2i2c-org/meta/issues/897#issuecomment-2010984904) for more information.

https://miro.com/app/board/uXjVNjUP3iQ=/, describes the various 'phases' of new hub turn-up. Each phase will be marked as "READY" or "NOT READY" when all information needed for it is available. Each section should also link to an appropriate runbook.

There will be customizations after this is all set up, but this is pathway towards a standardized hub turn up.

Phase 1: Account setup (READY)

This is applicable for cases where this is a dedicated cluster. The following table lists the information before this phase can start.

Question	Answer
Cloud Provider	AWS
Will 2i2c pay for cloud costs?	Yes
Name of cloud account	`linc`

Appropriate runbook: https://infrastructure.2i2c.org/hub-deployment-guide/cloud-accounts/new-aws-account/

Phase 2: Cluster setup (READY)

This assumes all engineers have access to this new account, and will be able to set up the cluster + support, without any new hubs being set up.

Question	Answer
Region / Zone of the cluster	us-east-1
Name of cluster	linc
Is GPU required?	yes

Appropriate runbooks:

New cluster setup: https://infrastructure.2i2c.org/hub-deployment-guide/new-cluster/aws/
Support deployment: https://infrastructure.2i2c.org/hub-deployment-guide/deploy-support/

Phase 3 : Hub setup (READY)

There's going to be a number of hubs, and this starts specifying them.

Hub 1: Staging

Phase 3.1: Initial setup

Question	Answer	Notes
Name of the hub	staging
Dask gateway?	no
Splash Image	https://github.com/lincbrain/linc-artwork/blob/main/linc.logo.color+white.notext.png?raw=true
URL	https://lincbrain.org/

Phase 3.2: Authentication

Question	Answer
Authentication Mechanism	GitHub (via GitHubOAuthenticator)
Org based access?	No
Admin Users	`@kabilar, @aaronkanzer, @asmacdo, @satra`

Phase 3.3: Object storage access

Question	Answer	Notes
Scratch bucket enabled?	Yes
Persistent bucket enabled?	no
Requestor pays requests to external buckets allowed?	no

Phase 3.4: Profile List

This was derived from looking at https://github.com/dandi/dandi-hub/blob/dandi/config.yaml.j2#L138-L210 and adopting to match our standards.

Environments

Display Name	Description	Overrides	Resource Allocation Choices
DANDI (CPU)	Default DANDI image with JupyterLab	image: dandiarchive/dandihub:latest image_pull_policy: Always	CPU (see below)
DANDI Matlab (CPU)	DANDI image with MATLAB. Requires you to bring your own license	image: dandiarchive/dandihub:latest-matlab image_pull_policy: Always	CPU
DANDI (GPU)	DANDI image with JupyterLab and GPU support	image: dandiarchive/dandihub:latest-gpu image_pull_policy: Always extra_resource_limits: nvidia.com/gpu: 1	GPU
DANDI Matlab (GPU)	DANDI Matlab image with GPU support. Requires you to bring your own license.	image: dandiarchive/dandihub:latest-gpu-matlab image_pull_policy: Always extra_resource_limits: nvidia.com/gpu: 1	GPU

Resource Allocations

CPU

Generated by deployer generate resource-allocation choices r5.xlarge --num-allocations 4

mem_3_7:
  display_name: 3.7 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 3982682624
    mem_limit: 3982682624
    cpu_guarantee: 0.46875
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
  default: true
mem_7_4:
  display_name: 7.4 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 7965365248
    mem_limit: 7965365248
    cpu_guarantee: 0.9375
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_14_8:
  display_name: 14.8 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 15930730496
    mem_limit: 15930730496
    cpu_guarantee: 1.875
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_29_7:
  display_name: 29.7 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 31861460992
    mem_limit: 31861460992
    cpu_guarantee: 3.75
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_60_6:
  display_name: 60.6 GB RAM, upto 15.7 CPUs
  kubespawner_override:
    mem_guarantee: 65094813696
    mem_limit: 65094813696
    cpu_guarantee: 7.86
    cpu_limit: 15.72
    node_selector:
      node.kubernetes.io/instance-type: r5.4xlarge
mem_121_2:
  display_name: 121.2 GB RAM, upto 15.7 CPUs
  kubespawner_override:
    mem_guarantee: 130189627392
    mem_limit: 130189627392
    cpu_guarantee: 15.72
    cpu_limit: 15.72
    node_selector:
      node.kubernetes.io/instance-type: r5.4xlarge
mem_244_9:
  display_name: 244.9 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 263005526016
    mem_limit: 263005526016
    cpu_guarantee: 31.8
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge
mem_489_9:
  display_name: 489.9 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 526011052032
    mem_limit: 526011052032
    cpu_guarantee: 63.6
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge

GPU

Manually set up, but should be autogenerated

gpu_1:
  display_name: 1 T4 GPU, ~4 CPUs, ~16GB of RAM
  kubespawner_override:
    mem_guarantee: 14G
    mem_limit: 16G
    cpu_guarantee: 3
    cpu_limit: 4
    node_selector:
      node.kubernetes.io/instance-type: g4dn.xlarge
  default: true
gpu_2:
  display_name: 1 T4 GPU, ~8 CPUs, ~32GB of RAM
  kubespawner_override:
    mem_guarantee: 29G
    mem_limit: 32G
    cpu_guarantee: 6
    cpu_limit: 8
    node_selector:
      node.kubernetes.io/instance-type: g4dn.2xlarge

Hub 2: LINC hub

The same as staging, just different name (linc).

yuvipanda commented 8 months ago

@consideRatio I believe this is now complete and you should be able to set up the hub now. I notice in https://github.com/2i2c-org/infrastructure/pull/3854 it's set up as a daskhub - note that it should instead be set up as a base hub.

As you go through these, if you find you're having to make choices that make use of information not present in this issue, please point it out so I can make sure to incorporate that into the process.

Thanks.

yuvipanda commented 8 months ago

Due to the fact that they want more CPUs in their GPU nodes, we need to set up g4dn.2xlarge nodes as well in eksctl.