Eta0 commented 1 year ago

CUDA SSH Server

This adds a container for a simple SSH server in a pod corresponding to this documentation example.

Used in: coreweave/kubernetes-cloud/cuda-ssh Adapted from: docker.io/atlanticcrypto/cuda-ssh-server

Its base image is ghcr.io/coreweave/ml-containers/torch, from this repository. At the moment it uses the torch:base edition because torch:nccl is insuitable for SSH sessions, as its necessary environment variables don't propagate into SSH sessions. This may be switched to use torch:nccl as a base pending an update upstream in the coreweave/nccl-tests repo and its images to fix that issue.

arsenetar commented 1 year ago

Is the plan to update the documentation & example files to use this image for the deployment as well as to initialize the image following a similar manner to what had been done previously? Using key based authentication seems like a safer default, although documentation for the example does follow password auth.

Eta0 commented 1 year ago

Is the plan to update the documentation & example files to use this image for the deployment as well as to initialize the image following a similar manner to what had been done previously? Using key based authentication seems like a safer default, although documentation for the example does follow password auth.

@arsenetar Yes, I was going to suggest a new series of steps for the docs in the corresponding PR for kubernetes-cloud, but I couldn't make that PR earlier because GitHub went down. The new steps would be:

# Create persistent storage volumes
kubectl apply -f sshd-root-pvc.yaml -f sshd-data-pvc.yaml

# Create the SSH service and public IP
kubectl apply -f sshd-service.yaml

# Launch the SSH server
kubectl apply -f sshd-deployment.yaml

# Wait for the SSH server to come up
kubectl rollout status deployment/sshd

# Add your SSH public key
kubectl exec -i deployment/sshd -c sshd -- /bin/tee --append /root/.ssh/authorized_keys < ~/.ssh/id_rsa.pub

# Print the public IP address
kubectl get service/sshd -o jsonpath='service/sshd public IP: {.status.loadBalancer.ingress[0].ip}{"\n"}'

# Print the host key fingerprints
kubectl exec deployment/sshd -c sshd -- find /etc/ssh -type f -name 'ssh_host_*_key.pub' -exec ssh-keygen -lf {} ';'

# Connect
ssh root@<public IP>

(No preference on keeping the old kubectl get service versus the specialized one to print just the IP) I think the kubectl exec line to add a public key ought to be pretty easy to follow.

The initialization is trimmed down to:

if [ ! -f /target/initialized ]; then
  dpkg-reconfigure openssh-server && \
  cp -ax / /target && \
  echo 'Initialization complete' && \
  touch /target/initialized;
fi

arsenetar commented 1 year ago

@wbrown Yeah, just wanted to make sure we were going to update the docs to work with this.

rtalaricw commented 1 year ago

@Eta0 We can merge this.

Eta0 commented 1 year ago

@Eta0 We can merge this.

@rtalaricw

Nice — I did also want to raise another question before merging. @salanki had suggested using nccl-tests as a base for this container, and now that nccl-tests's PR #16 has been merged in, we could actually use torch:nccl instead of torch:base for these.

Pros of that would be:

torch:nccl is based on the nvidia/cuda devel containers featuring cuDNN, and contain a lot of CUDA development tools by default, like nvcc.
- If these SSH containers are intended mainly for development, that could make them more usable.

Cons would be:

torch:nccl includes significantly different PyTorch configuration options and supplemental libraries (e.g. HPC-X & newer nccl versions), so what works or doesn't work on that type of container may not correspond 1:1 to how a finished app would work running off of torch:base or bare nvidia/cuda images.
torch:nccl images are much larger.

What would y'all think fits better between those two options? We could also add in a matrix build for this with both editions (but it would be sort of complicated to fit a choice between those into the docs example).

rtalaricw commented 1 year ago

@Eta0 We can merge this.

@rtalaricw

Nice — I did also want to raise another question before merging. @salanki had suggested using nccl-tests as a base for this container, and now that nccl-tests's PR #16 has been merged in, we could actually use torch:nccl instead of torch:base for these.

Pros of that would be:

torch:nccl is based on the nvidia/cuda devel containers featuring cuDNN, and contain a lot of CUDA development tools by default, like nvcc.

If these SSH containers are intended mainly for development, that could make them more usable.

Cons would be:

torch:nccl includes significantly different PyTorch configuration options and supplemental libraries (e.g. HPC-X & newer nccl versions), so what works or doesn't work on that type of container may not correspond 1:1 to how a finished aop would work running off of torch:base or bare nvidia/cuda images.

torch:nccl images are much larger.

What would y'all think fits better between those two options? We could also add in a matrix build for this with both editions (but it would be sort of complicated to fit a choice between those into the docs example).

Lets do a matrix build! I think keeping it minimal is ideal but providing options is better.

wbrown commented 1 year ago

@Eta0 Read through your comments. There are reasons to stick with our torch build, including architecture impact and cuBLAS bugs. Image size really does impact things, and we should not underestimate this.

Eta0 commented 1 year ago

Oops, closed this by mistake. Working on the matrix build now.

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5226159354 Image: ghcr.io/coreweave/ml-containers/cuda-ssh:es-cuda-ssh-8b9c879-torch-ceeb8c2-nccl-cuda11.8.0-nccl2.16.2-1-torch2.0.1-vision0.15.2-audio2.0.2

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5226159354 Image: ghcr.io/coreweave/ml-containers/cuda-ssh:es-cuda-ssh-8b9c879-torch-ceeb8c2-base-cuda11.8.0-torch2.0.1-vision0.15.2-audio2.0.2

Eta0 commented 1 year ago

Matrix build added, and the corresponding coreweave/kubernetes-cloud PR is now up as well. It currently uses the version of this container built from torch:nccl as its base.

There are reasons to stick with our torch build, including architecture impact and cuBLAS bugs. Image size really does impact things, and we should not underestimate this.

@wbrown either option would use our torch builds; the choice would be between torch:base and torch:nccl. Since torch:nccl is based off of the development version of nvidia/cuda, it would probably make for a better development container, but it has the side effects of everything else that comes with torch:nccl instead of torch:base. Which one do you think works better, in this case?

wbrown commented 1 year ago

@Eta0 torch-nccl, quite reluctantly despite the increase in size. This would let the image work on more machines and architectures.

coreweave / ml-containers

feat(cuda-ssh): Add CUDA SSH server container #20

CUDA SSH Server