2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
108 stars 65 forks source link

[New Hub] LINC (MIT Brain) #3828

Closed yuvipanda closed 5 months ago

yuvipanda commented 8 months ago

Copied over from https://github.com/2i2c-org/meta/issues/913

Process Note

I'm using this as a way to try to rejig our new hub request process. See https://github.com/2i2c-org/meta/issues/897 (particularly https://github.com/2i2c-org/meta/issues/897#issuecomment-2010984904) for more information.

https://miro.com/app/board/uXjVNjUP3iQ=/, describes the various 'phases' of new hub turn-up. Each phase will be marked as "READY" or "NOT READY" when all information needed for it is available. Each section should also link to an appropriate runbook.

There will be customizations after this is all set up, but this is pathway towards a standardized hub turn up.

Phase 1: Account setup (READY)

This is applicable for cases where this is a dedicated cluster. The following table lists the information before this phase can start.

Question Answer
Cloud Provider AWS
Will 2i2c pay for cloud costs? Yes
Name of cloud account linc

Appropriate runbook: https://infrastructure.2i2c.org/hub-deployment-guide/cloud-accounts/new-aws-account/

Phase 2: Cluster setup (READY)

This assumes all engineers have access to this new account, and will be able to set up the cluster + support, without any new hubs being set up.

Question Answer
Region / Zone of the cluster us-east-1
Name of cluster linc
Is GPU required? yes

Appropriate runbooks:

Phase 3 : Hub setup (READY)

There's going to be a number of hubs, and this starts specifying them.

Hub 1: Staging

Phase 3.1: Initial setup

Question Answer Notes
Name of the hub staging
Dask gateway? no
Splash Image https://github.com/lincbrain/linc-artwork/blob/main/linc.logo.color+white.notext.png?raw=true
URL https://lincbrain.org/

Phase 3.2: Authentication

Question Answer
Authentication Mechanism GitHub (via GitHubOAuthenticator)
Org based access? No
Admin Users @kabilar, @aaronkanzer, @asmacdo, @satra

Phase 3.3: Object storage access

Question Answer Notes
Scratch bucket enabled? Yes
Persistent bucket enabled? no
Requestor pays requests to external buckets allowed? no

Phase 3.4: Profile List

This was derived from looking at https://github.com/dandi/dandi-hub/blob/dandi/config.yaml.j2#L138-L210 and adopting to match our standards.

Environments
Display Name Description Overrides Resource Allocation Choices
DANDI (CPU) Default DANDI image with JupyterLab
image: dandiarchive/dandihub:latest
image_pull_policy: Always
CPU (see below)
DANDI Matlab (CPU) DANDI image with MATLAB. Requires you to bring your own license
image: dandiarchive/dandihub:latest-matlab
image_pull_policy: Always
CPU
DANDI (GPU) DANDI image with JupyterLab and GPU support
image: dandiarchive/dandihub:latest-gpu
image_pull_policy: Always
extra_resource_limits:
nvidia.com/gpu: 1
GPU
DANDI Matlab (GPU) DANDI Matlab image with GPU support. Requires you to bring your own license.

image: dandiarchive/dandihub:latest-gpu-matlab
image_pull_policy: Always
extra_resource_limits:
nvidia.com/gpu: 1
GPU
Resource Allocations
CPU

Generated by deployer generate resource-allocation choices r5.xlarge --num-allocations 4

mem_3_7:
  display_name: 3.7 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 3982682624
    mem_limit: 3982682624
    cpu_guarantee: 0.46875
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
  default: true
mem_7_4:
  display_name: 7.4 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 7965365248
    mem_limit: 7965365248
    cpu_guarantee: 0.9375
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_14_8:
  display_name: 14.8 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 15930730496
    mem_limit: 15930730496
    cpu_guarantee: 1.875
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_29_7:
  display_name: 29.7 GB RAM, upto 3.7 CPUs
  kubespawner_override:
    mem_guarantee: 31861460992
    mem_limit: 31861460992
    cpu_guarantee: 3.75
    cpu_limit: 3.75
    node_selector:
      node.kubernetes.io/instance-type: r5.xlarge
mem_60_6:
  display_name: 60.6 GB RAM, upto 15.7 CPUs
  kubespawner_override:
    mem_guarantee: 65094813696
    mem_limit: 65094813696
    cpu_guarantee: 7.86
    cpu_limit: 15.72
    node_selector:
      node.kubernetes.io/instance-type: r5.4xlarge
mem_121_2:
  display_name: 121.2 GB RAM, upto 15.7 CPUs
  kubespawner_override:
    mem_guarantee: 130189627392
    mem_limit: 130189627392
    cpu_guarantee: 15.72
    cpu_limit: 15.72
    node_selector:
      node.kubernetes.io/instance-type: r5.4xlarge
mem_244_9:
  display_name: 244.9 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 263005526016
    mem_limit: 263005526016
    cpu_guarantee: 31.8
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge
mem_489_9:
  display_name: 489.9 GB RAM, upto 63.6 CPUs
  kubespawner_override:
    mem_guarantee: 526011052032
    mem_limit: 526011052032
    cpu_guarantee: 63.6
    cpu_limit: 63.6
    node_selector:
      node.kubernetes.io/instance-type: r5.16xlarge
GPU

Manually set up, but should be autogenerated

gpu_1:
  display_name: 1 T4 GPU, ~4 CPUs, ~16GB of RAM
  kubespawner_override:
    mem_guarantee: 14G
    mem_limit: 16G
    cpu_guarantee: 3
    cpu_limit: 4
    node_selector:
      node.kubernetes.io/instance-type: g4dn.xlarge
  default: true
gpu_2:
  display_name: 1 T4 GPU, ~8 CPUs, ~32GB of RAM
  kubespawner_override:
    mem_guarantee: 29G
    mem_limit: 32G
    cpu_guarantee: 6
    cpu_limit: 8
    node_selector:
      node.kubernetes.io/instance-type: g4dn.2xlarge

Hub 2: LINC hub

The same as staging, just different name (linc).

yuvipanda commented 8 months ago

@consideRatio I believe this is now complete and you should be able to set up the hub now. I notice in https://github.com/2i2c-org/infrastructure/pull/3854 it's set up as a daskhub - note that it should instead be set up as a base hub.

As you go through these, if you find you're having to make choices that make use of information not present in this issue, please point it out so I can make sure to incorporate that into the process.

Thanks.

yuvipanda commented 8 months ago

Due to the fact that they want more CPUs in their GPU nodes, we need to set up g4dn.2xlarge nodes as well in eksctl.

consideRatio commented 8 months ago

As you go through these, if you find you're having to make choices that make use of information not present in this issue, please point it out so I can make sure to incorporate that into the process.

Hub 2: LINC hub

The same as staging, just different name (linc).

This made me think that you requested that instead of naming the hub prod in our config, we may name it linc, which raised questions like: do we let the domain name be linc.2i2c.cloud or linc.linc.2i2c.cloud?

Looking at how you set things up for bican, I'm assuming the config name should be prod, and the domain name should be linc.2i2c.cloud without staging, allowing for specialized hubs to be named <something>.linc.2i2c.cloud.

consideRatio commented 8 months ago

@yuvipanda when filling in funded_by, I don't know what to write. On the lincbrain website I see this, but it doesn't mean this 2i2c hub should be considered funded by them also.

image

For now, leaving it blank:

          funded_by:
            name: ""
            url: ""
consideRatio commented 8 months ago

Phase 3.2: Authentication

Question Answer Authentication Mechanism GitHub (via GitHubOAuthenticator) Org based access? No Admin Users @kabilar, @aaronkanzer, @asmacdo, @satra

I'll set this up to provide only the admin users access for now, not enabling allow_all: true or similar.

consideRatio commented 8 months ago

staging and prod cluster's display_name: from cluster.yaml is not explicitly specified

We got three different choices on combining LINC / DANDI / BICAN with MIT, and with (prod) or no (prod).

consideRatio commented 8 months ago

I obseved also a discrepancy on how we configure jupyterhub.custom.homepage.templateVars. Once in common without adjustment in staging/prod, or adjustment in staging/prod for the org.name. I'll go with the one without customization for this dedicated hub, as staging remains in the domain name and that may be sufficient distinction.

template-vars

consideRatio commented 8 months ago

Default to /lab and allowedNamedServers: true was assumed to be wanted based on config from dandi/bican, but not explicit in the specification.

consideRatio commented 8 months ago

I tested the biggest resource allocation option for all machine types, and only the GPU options spawned - not the others. I'll look into fixing it for dandi/bican/linc.

scaleup-fail-for-max-res-allocation-in-cpu

EDIT: Fixed fix bican/dandi/linc in PR

consideRatio commented 8 months ago

Starting up a GPU server (don't remember what image) took sometime between 9-10 minutes, and the startup timeout is 10 minutes. I've increased the timeout to 15 minutes to provide some margin of error for bican/dandi/linc for now.

EDIT: Fixed in PR

yuvipanda commented 8 months ago

Thanks for the feedback, @consideRatio. I'll incorporate them into the process.

consideRatio commented 5 months ago

I think this was completed and then we decomissioned it also - closing.