Closed JiaweiZhuang closed 4 years ago
Here're the complete steps to install NVIDIA-Docker on Ubuntu-18.04, AWS p2.xlarge
instance. The commands are a bit dense, but they can be wrapped into a single shell script.
1. Install CUDA driver
Get the relatively new nvidia-430
version from https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt-get update
sudo apt-get install -y nvidia-driver-430 nvidia-modprobe
Test installation:
$ nvidia-smi
Mon Oct 7 18:43:14 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 47C P0 59W / 149W | 0MiB / 11441MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
2. Install the standard Docker
Follow https://docs.docker.com/install/linux/docker-ce/ubuntu/
sudo apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo groupadd docker
sudo usermod -aG docker $USER # allow running docker without sudo, need to re-login
Test installation:
$ docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
3. Install NVIDIA-Docker
Follow https://github.com/NVIDIA/nvidia-docker#quickstart
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Test installation (https://github.com/NVIDIA/nvidia-docker#usage):
$ docker run --gpus all nvidia/cuda:9.0-base nvidia-smi
Mon Oct 7 18:46:13 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 52C P0 69W / 149W | 0MiB / 11441MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The easiest way to get PyTorch-GPU image is from NVIDIA NGC. The image contains a lot of stuff including JupyterLab and TensorBoard (see release notes)
docker pull nvcr.io/nvidia/pytorch:19.09-py3
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:19.09-py3
Inside the container, try:
$ python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
1.2.0a0+afb7a16 True
It is the latest 1.2.0
version. Need to roll back to 0.3.1
.
Done via https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/ce4ed92054270418c03794c4194694970a7f478c and https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/f3b393846a4df37a7c1b071ee742cfdf909b898b
Follow the README in docker/install-nvidia-docker and docker/darts-pytorch-image. Everyone should be able to run the default script and get the expected result:
10/07 08:02:25 PM test 000 1.233736e-01 96.875000 100.000000
10/07 08:02:48 PM test 050 1.105459e-01 97.120095 99.959150
10/07 08:03:11 PM test 100 1.074739e-01 97.359733 99.948432
10/07 08:03:12 PM test_acc 97.369997
@dylanrandle Here's how to run DARTS on graphene data within the container:
# get data and source code
mkdir data
wget https://capstone2019-google.s3.amazonaws.com/graphene_processed.nc -P ./data/
git clone https://github.com/capstone2019-neuralsearch/darts.git
# run training
docker run --rm -it --gpus all -v $(pwd):/workdir/host_files darts-pytorch
cd host_files
python3 darts/cnn/train_search.py --data ./data/ --dataset graphene
This is absolutely awesome. Brilliant!
I changed pytorch==0.3.1
(built with CUDA80) to http://download.pytorch.org/whl/cu90/torch-0.3.1-cp36-cp36m-linux_x86_64.whl
(built with CUDA 90), otherwise the DARTS script will crash on new GPU types such as p3.2xlarge
(V100):
/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py:95: UserWarning:
Found GPU0 Tesla V100-SXM2-16GB which requires CUDA_VERSION >= 9000 for
optimal performance and fast startup time, but your PyTorch was compiled
with CUDA_VERSION 8000. Please install the correct PyTorch binary
using instructions from http://pytorch.org
warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))
The best way to run the DARTS scripts in production is probably via nvidia-docker. It is important to freeze the environment as the DARTS code requires
PyTorch == 0.3.1, torchvision == 0.2.0
. Newer pytorch version crashes for various reasons.The same container image can run on