churchlab / ml-ami-builder

Packer scripts to build nvidia-enabled AMIs
19 stars 4 forks source link

GPU instances no longer seem to work #14

Open glebkuznetsov opened 6 years ago

glebkuznetsov commented 6 years ago

Provisioned a new instance of the latest AMI ami-91146aeb on a GPU instance type.

Then I ssh'ed in and tried running:

ubuntu@ip-172-30-2-213:~$ nvidia-smi

I get the error:

Failed to initialize NVML: Driver/library version mismatch

I tried this for each of clean AMIs of p2.xlarge, g3.4xlarge, and p3.2xlarge.

glebkuznetsov commented 6 years ago

Ssh'ed into one of the machines to get Nvidia drivers and Cuda versions.

Nvidia drivers version:

ubuntu@ip-172-30-2-246:~$ lsmod | grep nvidia
nvidia_uvm            671744  2
nvidia_drm             49152  0
nvidia_modeset        843776  1 nvidia_drm
drm_kms_helper        147456  1 nvidia_drm
drm                   364544  3 drm_kms_helper,nvidia_drm
nvidia              13008896  32 nvidia_modeset,nvidia_uvm

Cuda version:

ubuntu@ip-172-30-2-246:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
glebkuznetsov commented 6 years ago

Got this to work via manual install and now proceeding to update the packer.

Here's the steps for installing just nvidia drivers (no Cuda) and then nvidia-docker-2

  1. Install nvidia drivers

Followed these instructions: http://www.linuxandubuntu.com/home/how-to-install-latest-nvidia-drivers-in-linux

Installed nvidia-384

Running nvidia-smi immediately gives error.

But then I reboot and ran it and get the following expected output:

ubuntu@ip-172-30-2-48:~$ nvidia-smi Fri Feb 2 20:38:48 2018
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 | | N/A 32C P0 33W / 300W | 0MiB / 16152MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

  1. Install docker

https://docs.docker.com/install/linux/docker-ce/ubuntu/#install-docker-ce-1

  1. Install nvidia-docker-2

https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)

Need Ubuntu repos from here: https://nvidia.github.io/nvidia-docker/

glebkuznetsov commented 6 years ago

This commit reproduces everything I did manually (I think), but it won't work with packer. It fails at the step apt-get install nvidia-384 in install_nvidia_drivers.sh.

https://github.com/churchlab/ml-ami-builder/commit/35dc2fcc73f3f6c2cea9c9304c0e8462e86d146a

glebkuznetsov commented 6 years ago

I fixed the bug where nvidia-384 was hanging (needed -y flag!).

However, now when trying to run the nvidia-docker, I get the following error with not loading libcuda.so. I'm confused because I thought nvidia-docker doesn't require Cuda, only nvidia-drivers.

@grinner Work so far on this branch https://github.com/churchlab/ml-ami-builder/tree/fix-gpu-by-simplifying-nvidia-driver-install

[ec2-34-229-135-123.compute-1.amazonaws.com] run: nvidia-docker run -it -v ~/notebooks:/notebooks --workdir /notebooks/mlpe-gfp-pilot/src/python/scripts --entrypoint python mlpe-gfp-pilot-test-docker-image test_gpu.py [ec2-34-229-135-123.compute-1.amazonaws.com] out: docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=8.0 --pid=4759 /var/lib/docker/overlay2/c7662ee96e47b8495267e5d615ca6187a2721774ce2720e86e736aa5e9f6d6a6/merged]\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\n\\"\"": unknown.

glebkuznetsov commented 6 years ago

Nailed it. Had to remove -no-install-recommends

Tests pass!

Results for tsting ami-ba9b98c0:
c5.xlarge PASS
m4.2xlarge PASS
g3.4xlarge PASS
p3.2xlarge PASS
p3.8xlarge PASS