Build NVIDIA-Docker image with proper PyTorch version to run DARTS code

JiaweiZhuang commented 4 years ago

The best way to run the DARTS scripts in production is probably via nvidia-docker. It is important to freeze the environment as the DARTS code requires PyTorch == 0.3.1, torchvision == 0.2.0. Newer pytorch version crashes for various reasons.

The same container image can run on

GPU instances on AWS or GCP
@memanuel 's in-house GPU server, if Docker can be configured properly
or even a k8s GPU cluster

JiaweiZhuang commented 4 years ago

Here're the complete steps to install NVIDIA-Docker on Ubuntu-18.04, AWS p2.xlarge instance. The commands are a bit dense, but they can be wrapped into a single shell script.

1. Install CUDA driver

Get the relatively new nvidia-430 version from https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt-get update
sudo apt-get install -y nvidia-driver-430 nvidia-modprobe

Test installation:

$ nvidia-smi
Mon Oct  7 18:43:14 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    59W / 149W |      0MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. Install the standard Docker

Follow https://docs.docker.com/install/linux/docker-ce/ubuntu/

sudo apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

sudo groupadd docker
sudo usermod -aG docker $USER  # allow running docker without sudo, need to re-login

Test installation:

$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

3. Install NVIDIA-Docker

Follow https://github.com/NVIDIA/nvidia-docker#quickstart

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Test installation (https://github.com/NVIDIA/nvidia-docker#usage):

$ docker run --gpus all nvidia/cuda:9.0-base nvidia-smi
Mon Oct  7 18:46:13 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   52C    P0    69W / 149W |      0MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

JiaweiZhuang commented 4 years ago

The easiest way to get PyTorch-GPU image is from NVIDIA NGC. The image contains a lot of stuff including JupyterLab and TensorBoard (see release notes)

docker pull nvcr.io/nvidia/pytorch:19.09-py3
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:19.09-py3

Inside the container, try:

$ python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
1.2.0a0+afb7a16 True

It is the latest 1.2.0 version. Need to roll back to 0.3.1.

JiaweiZhuang commented 4 years ago

Done via https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/ce4ed92054270418c03794c4194694970a7f478c and https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/f3b393846a4df37a7c1b071ee742cfdf909b898b

Follow the README in docker/install-nvidia-docker and docker/darts-pytorch-image. Everyone should be able to run the default script and get the expected result:

10/07 08:02:25 PM test 000 1.233736e-01 96.875000 100.000000
10/07 08:02:48 PM test 050 1.105459e-01 97.120095 99.959150
10/07 08:03:11 PM test 100 1.074739e-01 97.359733 99.948432
10/07 08:03:12 PM test_acc 97.369997

JiaweiZhuang commented 4 years ago

@dylanrandle Here's how to run DARTS on graphene data within the container:

# get data and source code
mkdir data
wget https://capstone2019-google.s3.amazonaws.com/graphene_processed.nc -P ./data/
git clone https://github.com/capstone2019-neuralsearch/darts.git

# run training
docker run --rm -it --gpus all -v $(pwd):/workdir/host_files darts-pytorch
cd host_files
python3 darts/cnn/train_search.py --data ./data/ --dataset graphene

dylanrandle commented 4 years ago

This is absolutely awesome. Brilliant!

JiaweiZhuang commented 4 years ago

I changed pytorch==0.3.1 (built with CUDA80) to http://download.pytorch.org/whl/cu90/torch-0.3.1-cp36-cp36m-linux_x86_64.whl (built with CUDA 90), otherwise the DARTS script will crash on new GPU types such as p3.2xlarge (V100):

/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py:95: UserWarning:
    Found GPU0 Tesla V100-SXM2-16GB which requires CUDA_VERSION >= 9000 for
     optimal performance and fast startup time, but your PyTorch was compiled
     with CUDA_VERSION 8000. Please install the correct PyTorch binary
     using instructions from http://pytorch.org

  warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))

See https://github.com/capstone2019-neuralsearch/AC297r_2019_NAS/commit/2172d1c5ed8b89c048b7e5d4f957db18bff1fabd

capstone2019-neuralsearch / AC297r_2019_NAS

Build NVIDIA-Docker image with proper PyTorch version to run DARTS code #10