Unable to access GPUs without `--gpus all`, locally and on Beaker

mbforbes commented 3 years ago

I reproduced this issue in https://github.com/mbforbes/beaker-docker, where I use the Dockerfile from https://github.com/beaker/docs/blob/main/docs/start/run.md. The issue also happens to me when I start from an nvidia/cuda image (nvidia/cuda:11.4.1-cudnn8-runtime-ubuntu20.04).

Copying the output from there:

$ docker --version
Docker version 20.10.8, build 3967b7d

$ docker build -t my-experiment .
# ...

$ docker run --rm -it my-experiment nvidia-smi
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:380: starting container process caused: exec:
"nvidia-smi": executable file not found in $PATH: unknown.

$ docker run --rm -it --gpus all my-experiment nvidia-smi
Mon Aug 23 17:35:13 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    51W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

When trying to run on Beaker, I see the same error:

StartError: failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown

This StackOverflow answer says

Since Docker 19.03, you need to install nvidia-container-toolkit package and then use the --gpus all flag.

What is confusing is that @ckarenz was able to run nvidia-smi successfully in another experiment on the same cluster running directly from the base image.

So, does this mean there is a difference in how the images are built that is changing the requirement for --gpus all?

+cc @schmmd

schmmd commented 3 years ago

@mbforbes I'm not able to reproduce this. When I build the Dockerfile you linked to (note I had to mkdir scripts) nvidia-smi worked without --gpus all on allennlp-server4.

michaels@allennlp-server4.corp ~/sandbox/beaker-docker $ docker run eed781ed3e9e nvidia-smi
Mon Aug 23 19:20:29 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     Off  | 00000000:1A:00.0 Off |                  Off |
| 33%   58C    P2   197W / 260W |  17630MiB / 48601MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     Off  | 00000000:1B:00.0 Off |                  Off |
| 38%   63C    P2   201W / 260W |  18256MiB / 48601MiB |     97%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000     Off  | 00000000:60:00.0 Off |                  Off |
| 39%   64C    P2   204W / 260W |  17634MiB / 48601MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 8000     Off  | 00000000:61:00.0 Off |                  Off |
| 40%   65C    P2   214W / 260W |  19008MiB / 48601MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Quadro RTX 8000     Off  | 00000000:B1:00.0 Off |                  Off |
| 33%   34C    P8   121W / 260W |      0MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Quadro RTX 8000     Off  | 00000000:B2:00.0 Off |                  Off |
| 33%   34C    P8   122W / 260W |      0MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Quadro RTX 8000     Off  | 00000000:DA:00.0 Off |                  Off |
| 33%   29C    P8   120W / 260W |      0MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Quadro RTX 8000     Off  | 00000000:DB:00.0 Off |                  Off |
| 33%   34C    P8   120W / 260W |      0MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

schmmd commented 3 years ago

If I run on my laptop (where there is no GPU) I do get an nvidia-smi error.

$ docker run b5d021ea170c nvidia-smi                                                                                            (base) 
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown.

I think docker is doing some funny voodoo with nvidia-smi and it's best not to use this to test the presence of GPUs on a Beaker Batch job. I would use torch specific commands instead, such as torch.cuda.is_available().

mbforbes commented 3 years ago

Thanks Michael. What's the output of docker --version ?

schmmd commented 3 years ago

michaels@MacBook-Pro-2 ~> docker --version
Docker version 20.10.2, build 2291f61

mbforbes commented 3 years ago

I added more clarification and output to the README.md in this repo: https://github.com/mbforbes/beaker-docker

Feel free to close this issue unless you'd like to solve the underlying nvidia-smi mystery.

To recap why I think this was so confusing:

Locally, if I don't pass --gpus all or --gpus 0 or --runtime=nvidia, then a. nvidia-smi fails with a specific error message b. torch will not see CUDA as available, and will see no devices
On beaker, if I run the same container a. nvidia-smi fails with the exact same error message as if I hadn't passed the above flags locally b. however, torch will see CUDA as available, and will see devices

The issue I originally encountered (a RuntimeError: NCCL Error 2: unhandled system error) remains unsolved due to other issues :-)

I'm also not sure how well the devices with the tutorial Dockerimage actually function. When I run the example container locally, it does pass the "CUDA available" and "CUDA device count" checks. But when I try to actually load up a CUDA tensor, e.g., with

torch.cuda.FloatTensor(1000, 1000).fill_(42)

I get an error that the installed CUDA capabilities (sm_37 sm_50 sm_60 sm_70) don't match the capability of my GPU (sm_80). It might be helpful to add some more installation instructions / testing depending on the device used? E.g., going from a nvidia/cuda base image instead of a python one? Just throwing some ideas out there 😅

schmmd commented 3 years ago

@epwalsh what base image do you use when you run Beaker batch jobs? I updated our documentation example recently based on some user feedback and I'm using the setup that AllenNLP uses, but it's causing @mbforbes some trouble.

schmmd commented 3 years ago

I worry the error your getting is a driver mismatch and that I need to update the documentation to use a particular CUDA version of PyTorch. @aaasen what CUDA version are we on presently with Beaker?

epwalsh commented 3 years ago

Driver mismatch is likely an issue. @aaasen can correct me if I'm wrong, but I believe all of servers are now on CUDA 11.x, so you should be using PyTorch for CUDA 11.1.

Recently I've been using this Dockerfile: https://github.com/allenai/gpt-2-repro/blob/main/Dockerfile. This was mostly copied from the AllenNLP Dockerfile.

schmmd commented 3 years ago

I also copied from the AllenNLP Dockerfile, but I forgot to specify the CUDA version of torch. I'll add the FloatTensor @mbforbes has to the example code, set CUDA to 11.1, and try it out later tonight.

schmmd commented 3 years ago

As promised, here's an update to the docs: https://github.com/beaker/pytorch-example/pull/1

schmmd commented 3 years ago

That said, on an on-premise machine when I run torch.cuda.FloatTensor(1000, 1000).fill_(42) with CUDA 10.2 I don't get a failure.

mbforbes commented 3 years ago

I appreciate this! The pytorch base image from that PR is a good recommendation— was going from one of nvidia's, but my python installation has ended up a bit weird.

epwalsh commented 3 years ago

Oh, I should mention we have pre-built images for PyTorch that you can use for a base image: https://github.com/allenai/docker-images/pkgs/container/pytorch. Just do something like this:

FROM ghcr.io/allenai/pytorch:1.9.0-cuda11.1-python3.8

RUN pip install --no-cache-dir -r my-extra-requirements.txt

schmmd commented 3 years ago

Awesome! I was just thinking we needed this...

On Wed, Aug 25, 2021, 4:21 PM Pete @.***> wrote:

Oh, I should mention we have pre-built images for PyTorch that you can use for a base image: https://github.com/allenai/docker-images/pkgs/container/pytorch. Just do something like this:

FROM ghcr.io/allenai/pytorch:1.9.0-cuda11.1-python3.8 RUN pip install --no-cache-dir -r my-extra-requirements.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/beaker/issues/326#issuecomment-905940047, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHJDLSDL6U67SL6TSWYIKDT6V3HRANCNFSM5CVCKRWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

mbforbes commented 3 years ago

Apologies for the random question, Pete, but I guess now is as good a time as any: I haven't understood why --no-cache-dir is needed?

epwalsh commented 3 years ago

@mbforbes it's not necessary, but it will make your final image size a little bit smaller. If you omit --no-cache-dir, pip will save the downloaded wheels for the packages you install into a directory on the image (usually /root/.cache/pip, I think), which just wastes space.

mbforbes commented 3 years ago

Ahh, for size not correctness, makes sense. Thank you!!

schmmd commented 3 years ago

It made a 2 gb difference for me on pytorch. Worth it!

On Thu, Aug 26, 2021, 4:33 PM Maxwell Forbes @.***> wrote:

Ahh, for size not correctness, makes sense. Thank you!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/beaker/issues/326#issuecomment-906811482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHJDLQ3L5ULSC5SWS3QDEDT63FM3ANCNFSM5CVCKRWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

epwalsh commented 3 years ago

Glad to hear it!

On Thu, Aug 26, 2021 at 5:17 PM Michael Schmitz @.***> wrote:

It made a 2 gb difference for me on pytorch. Worth it!

On Thu, Aug 26, 2021, 4:33 PM Maxwell Forbes @.***> wrote:

Ahh, for size not correctness, makes sense. Thank you!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/beaker/issues/326#issuecomment-906811482, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAHJDLQ3L5ULSC5SWS3QDEDT63FM3ANCNFSM5CVCKRWA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/beaker/issues/326#issuecomment-906828019, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHPK3BFDOTABPMFEBBZXDT63KSPANCNFSM5CVCKRWA .

-- Evan "Pete" Walsh (he/him) | Research Engineer, AllenNLP

schmmd commented 3 years ago

@mbforbes I think your issue is resolved with accessing GPUs (it was a CUDA mismatch issue) but please reopen this if I'm mistaken.

allenai / beaker-cli

Unable to access GPUs without `--gpus all`, locally and on Beaker #326