Closed mbforbes closed 3 years ago
@mbforbes I'm not able to reproduce this. When I build the Dockerfile you linked to (note I had to mkdir scripts
) nvidia-smi
worked without --gpus all
on allennlp-server4
.
michaels@allennlp-server4.corp ~/sandbox/beaker-docker $ docker run eed781ed3e9e nvidia-smi
Mon Aug 23 19:20:29 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 Off | 00000000:1A:00.0 Off | Off |
| 33% 58C P2 197W / 260W | 17630MiB / 48601MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 8000 Off | 00000000:1B:00.0 Off | Off |
| 38% 63C P2 201W / 260W | 18256MiB / 48601MiB | 97% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 8000 Off | 00000000:60:00.0 Off | Off |
| 39% 64C P2 204W / 260W | 17634MiB / 48601MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Quadro RTX 8000 Off | 00000000:61:00.0 Off | Off |
| 40% 65C P2 214W / 260W | 19008MiB / 48601MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Quadro RTX 8000 Off | 00000000:B1:00.0 Off | Off |
| 33% 34C P8 121W / 260W | 0MiB / 48601MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Quadro RTX 8000 Off | 00000000:B2:00.0 Off | Off |
| 33% 34C P8 122W / 260W | 0MiB / 48601MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Quadro RTX 8000 Off | 00000000:DA:00.0 Off | Off |
| 33% 29C P8 120W / 260W | 0MiB / 48601MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Quadro RTX 8000 Off | 00000000:DB:00.0 Off | Off |
| 33% 34C P8 120W / 260W | 0MiB / 48601MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
If I run on my laptop (where there is no GPU) I do get an nvidia-smi
error.
$ docker run b5d021ea170c nvidia-smi (base)
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown.
I think docker
is doing some funny voodoo with nvidia-smi
and it's best not to use this to test the presence of GPUs on a Beaker Batch job. I would use torch
specific commands instead, such as torch.cuda.is_available()
.
Thanks Michael. What's the output of docker --version
?
michaels@MacBook-Pro-2 ~> docker --version
Docker version 20.10.2, build 2291f61
I added more clarification and output to the README.md in this repo: https://github.com/mbforbes/beaker-docker
Feel free to close this issue unless you'd like to solve the underlying nvidia-smi
mystery.
To recap why I think this was so confusing:
Locally, if I don't pass --gpus all
or --gpus 0
or --runtime=nvidia
, then
a. nvidia-smi
fails with a specific error message
b. torch will not see CUDA as available, and will see no devices
On beaker, if I run the same container
a. nvidia-smi
fails with the exact same error message as if I hadn't passed the above flags locally
b. however, torch will see CUDA as available, and will see devices
The issue I originally encountered (a RuntimeError: NCCL Error 2: unhandled system error
) remains unsolved due to other issues :-)
I'm also not sure how well the devices with the tutorial Dockerimage actually function. When I run the example container locally, it does pass the "CUDA available" and "CUDA device count" checks. But when I try to actually load up a CUDA tensor, e.g., with
torch.cuda.FloatTensor(1000, 1000).fill_(42)
I get an error that the installed CUDA capabilities (sm_37 sm_50 sm_60 sm_70
) don't match the capability of my GPU (sm_80
). It might be helpful to add some more installation instructions / testing depending on the device used? E.g., going from a nvidia/cuda
base image instead of a python
one? Just throwing some ideas out there 😅
@epwalsh what base image do you use when you run Beaker batch jobs? I updated our documentation example recently based on some user feedback and I'm using the setup that AllenNLP uses, but it's causing @mbforbes some trouble.
I worry the error your getting is a driver mismatch and that I need to update the documentation to use a particular CUDA version of PyTorch. @aaasen what CUDA version are we on presently with Beaker?
Driver mismatch is likely an issue. @aaasen can correct me if I'm wrong, but I believe all of servers are now on CUDA 11.x, so you should be using PyTorch for CUDA 11.1.
Recently I've been using this Dockerfile: https://github.com/allenai/gpt-2-repro/blob/main/Dockerfile. This was mostly copied from the AllenNLP Dockerfile.
I also copied from the AllenNLP Dockerfile, but I forgot to specify the CUDA version of torch. I'll add the FloatTensor
@mbforbes has to the example code, set CUDA to 11.1, and try it out later tonight.
As promised, here's an update to the docs: https://github.com/beaker/pytorch-example/pull/1
That said, on an on-premise machine when I run torch.cuda.FloatTensor(1000, 1000).fill_(42)
with CUDA 10.2 I don't get a failure.
I appreciate this! The pytorch base image from that PR is a good recommendation— was going from one of nvidia's, but my python installation has ended up a bit weird.
Oh, I should mention we have pre-built images for PyTorch that you can use for a base image: https://github.com/allenai/docker-images/pkgs/container/pytorch. Just do something like this:
FROM ghcr.io/allenai/pytorch:1.9.0-cuda11.1-python3.8
RUN pip install --no-cache-dir -r my-extra-requirements.txt
Awesome! I was just thinking we needed this...
On Wed, Aug 25, 2021, 4:21 PM Pete @.***> wrote:
Oh, I should mention we have pre-built images for PyTorch that you can use for a base image: https://github.com/allenai/docker-images/pkgs/container/pytorch. Just do something like this:
FROM ghcr.io/allenai/pytorch:1.9.0-cuda11.1-python3.8 RUN pip install --no-cache-dir -r my-extra-requirements.txt
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/beaker/issues/326#issuecomment-905940047, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHJDLSDL6U67SL6TSWYIKDT6V3HRANCNFSM5CVCKRWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Apologies for the random question, Pete, but I guess now is as good a time as any: I haven't understood why --no-cache-dir
is needed?
@mbforbes it's not necessary, but it will make your final image size a little bit smaller. If you omit --no-cache-dir
, pip
will save the downloaded wheels for the packages you install into a directory on the image (usually /root/.cache/pip
, I think), which just wastes space.
Ahh, for size not correctness, makes sense. Thank you!!
It made a 2 gb difference for me on pytorch. Worth it!
On Thu, Aug 26, 2021, 4:33 PM Maxwell Forbes @.***> wrote:
Ahh, for size not correctness, makes sense. Thank you!!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/beaker/issues/326#issuecomment-906811482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHJDLQ3L5ULSC5SWS3QDEDT63FM3ANCNFSM5CVCKRWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Glad to hear it!
On Thu, Aug 26, 2021 at 5:17 PM Michael Schmitz @.***> wrote:
It made a 2 gb difference for me on pytorch. Worth it!
On Thu, Aug 26, 2021, 4:33 PM Maxwell Forbes @.***> wrote:
Ahh, for size not correctness, makes sense. Thank you!!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/beaker/issues/326#issuecomment-906811482, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAHJDLQ3L5ULSC5SWS3QDEDT63FM3ANCNFSM5CVCKRWA
. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/allenai/beaker/issues/326#issuecomment-906828019, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHPK3BFDOTABPMFEBBZXDT63KSPANCNFSM5CVCKRWA .
-- Evan "Pete" Walsh (he/him) | Research Engineer, AllenNLP
@mbforbes I think your issue is resolved with accessing GPUs (it was a CUDA mismatch issue) but please reopen this if I'm mistaken.
I reproduced this issue in https://github.com/mbforbes/beaker-docker, where I use the
Dockerfile
from https://github.com/beaker/docs/blob/main/docs/start/run.md. The issue also happens to me when I start from annvidia/cuda
image (nvidia/cuda:11.4.1-cudnn8-runtime-ubuntu20.04
).Copying the output from there:
When trying to run on Beaker, I see the same error:
This StackOverflow answer says
What is confusing is that @ckarenz was able to run
nvidia-smi
successfully in another experiment on the same cluster running directly from the base image.So, does this mean there is a difference in how the images are built that is changing the requirement for
--gpus all
?+cc @schmmd