TamarLevy commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Also, before reporting a new issue, please make sure that:

You read carefully the documentation and frequently asked questions.
You searched for a similar issue and this is not a duplicate of an existing one.
This issue is not related to NGC, otherwise, please use the devtalk forums instead.
You went through the troubleshooting steps.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

[ ] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg
[ ] Driver information from nvidia-smi -a
[ ] Docker version from docker version
[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
[ ] NVIDIA container library version from nvidia-container-cli -V
[ ] NVIDIA container library logs (see troubleshooting)
[ ] Docker command, image and tag used

TamarLevy commented 2 years ago

I installed latest nvidia-container toolkit. my nvidia driver version is: 515.76 my cuda version on my machine is: 11.7 my docker version is: 20.10.18

I created a Dockerfile with this content: FROM nvidia/cuda:11.4-runtime-ubuntu20.04 CMD nvcc --version

I build docker file using: docker build . -t nvidia-test

and the output is: Sending build context to Docker daemon 798.3MB Step 1/2 : FROM nvidia/cuda:11.4-runtime-ubuntu20.04 Get "https://registry-1.docker.io/v2/": x509: certificate signed by unknown authority

and run it using: docker run --gpus all nvidia-test

and the output is:

============= == PyTorch ==

NVIDIA Release 21.10 (build 28019337) PyTorch Version 1.10.0a0+0aef44c

Copyright (c) 2014-2021 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be insufficient for PyTorch. NVIDIA recommends the use of the following flags: docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

I dont see that nvcc --version was executed, it more look like an output of nvidia-smi also I dont see anywhere that cuda version 11.4 was used. what am I doing wrong?

lino202 commented 1 year ago

I think that you need to change

FROM nvidia/cuda:11.4-runtime-ubuntu20.04

to

FROM nvidia/cuda:11.4-devel-ubuntu20.04

in your Dockerfile, which is the developer image for nvidia cuda. That will give you access to nvcc and other developer's capabilities.

NVIDIA / nvidia-container-toolkit

using nvcc in docker file does not show the version of the pulled container image #239

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

============= == PyTorch ==