Launching GPU with nvidia runtime

aimran-adroll commented 5 months ago

I would like to be able to launch notebooks using containers with nvidia runtime.

It'd be good to know if its supported before I spend time preparing an image with additional dask requirements

mrocklin commented 5 months ago

Hey @aimran-adroll , I suspect that the answer is "yes" although you might also be interested in recent GPU developments in Coiled in the last couple months (package sync works, better GPU metrics, etc..). If you're game, it might be good to have you talk to @jrbourbeau who did a bunch of this work. I'll bet that he could point you in some fruitful directions. If that's interesting send me a note offline and we'll set something up.

cc'ing @ntabris to give the definitive "yes that's fine" to your stated question though

ntabris commented 5 months ago

Yes, that's fine. The VMs have NVIDIA Container Toolkit and you can use containers that see and use GPU with NVIDIA driver + CUDA.

aimran-adroll commented 5 months ago

Thanks both @ntabris and @mrocklin

I will give it a go. I suspect my first attempt failed since it did not have the obvious dask/jupyter related packages 🤦🏽‍♂️

Super exciting to be able to launch gpu notebooks

ntabris commented 5 months ago

FYI this doc says what our docker run command needs, so you can validate container locally if you want.

aimran-adroll commented 5 months ago

This little dockerfile did not work

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

WORKDIR /src

RUN pip install -U pip
RUN pip install dask coiled ipykernel ipython dask-labextension jupyterlab jupyterlab matplotlib

Locally it passed the check that @ntabris mentioned

❯ docker run --rm nvidia-merlin python -m distributed.cli.dask_spec \
        --spec '{"cls":"dask.distributed.Scheduler", "opts":{}}'

Command to launch notebook

coiled notebook start --vm-type g5.xlarge --container redacted.dkr.ecr.us-west-2.amazonaws.com/aitest/nv-merlin:latest --region us-west-2 --name ai-tf

Gist of the error

coiled.errors.ClusterCreationError: Cluster status is error (reason: Scheduler Stopped -> Software environment exited with error code 1.) (cluster_id: 494802)

ntabris commented 5 months ago

Ah, sorry, this isn't easy to spot but I think the problem is with mismatch between image and VM arch. When I dig in to the (not super easy to find) logs, I see this:

dask The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64/v3) and no specific platform was requested

aimran-adroll commented 5 months ago

Thanks for the quick debugging. 🚀

aside: We need a cloud startup that lets you modify/build/push docker image in the cloud on just the right machine 😄

Once you are done pushing out 7GB image over residential network, I have forgotten what I wanted to do in the first place

mrocklin commented 5 months ago

I'd be curious to learn more about why you want to use Docker in the first place. My guess is that either there a piece of software that you're trying to distribute that isn't in a convenient conda repository, or that it's just very culturally entrenched. If that wasn't the reason, I'd probably want to question the choice of Docker and see if there is some other approach we could facilitate.

aimran-adroll commented 5 months ago

Great question.

Its a fairly typical workflow for us/me. I want to try a new ML (or whatever) package. I have no idea what the dependancies are (esp because it involves cuda, magical mix of different packages). The exact source recipe is not always easy to track down. I also have to weigh the upfront time investment.

In these scenarios, a docker container is a perfect answer to my conundrum -- quick and easy to evaluate something new

mrocklin commented 5 months ago

So, for common ML packages (PyTorch, TensorFlow, XGBoost, ...) we've been teaching package sync how to do the translation between CPU and GPU versions. So if your package is mostly depending on those (say you want to use some huggingface transformers package) then the answer is that you just conda install it on your local machine and then have Coiled spin up a cluster with GPUs attached. Coiled notices the change in architecture, swaps out the relevant packages, and has the conda solver fill in any gaps.

It's pretty magical.

If there was some other baseline GPU package that you needed (say, Jax) that didn't already have this treatment then we could add it. The main reason to not use package sync in this case is if there is some GPU package for which there is no CPU equivalent, and that you couldn't install on a non-GPU machine.

aimran-adroll commented 5 months ago

wow. that does sound magical

🏃🏽‍♂️ trying it now

mrocklin commented 5 months ago

See https://docs.coiled.io/user_guide/gpu-job.html#example-train-a-gpu-accelerated-pytorch-model

coiled / feedback

Launching GPU with nvidia runtime #284