Open glebkuznetsov opened 6 years ago
Ssh'ed into one of the machines to get Nvidia drivers and Cuda versions.
Nvidia drivers version:
ubuntu@ip-172-30-2-246:~$ lsmod | grep nvidia
nvidia_uvm 671744 2
nvidia_drm 49152 0
nvidia_modeset 843776 1 nvidia_drm
drm_kms_helper 147456 1 nvidia_drm
drm 364544 3 drm_kms_helper,nvidia_drm
nvidia 13008896 32 nvidia_modeset,nvidia_uvm
Cuda version:
ubuntu@ip-172-30-2-246:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
Got this to work via manual install and now proceeding to update the packer.
Here's the steps for installing just nvidia drivers (no Cuda) and then nvidia-docker-2
Followed these instructions: http://www.linuxandubuntu.com/home/how-to-install-latest-nvidia-drivers-in-linux
Installed nvidia-384
Running nvidia-smi
immediately gives error.
But then I reboot and ran it and get the following expected output:
ubuntu@ip-172-30-2-48:~$ nvidia-smi
Fri Feb 2 20:38:48 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| N/A 32C P0 33W / 300W | 0MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
https://docs.docker.com/install/linux/docker-ce/ubuntu/#install-docker-ce-1
https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
Need Ubuntu repos from here: https://nvidia.github.io/nvidia-docker/
This commit reproduces everything I did manually (I think), but it won't work with packer. It fails at the step apt-get install nvidia-384
in install_nvidia_drivers.sh.
https://github.com/churchlab/ml-ami-builder/commit/35dc2fcc73f3f6c2cea9c9304c0e8462e86d146a
I fixed the bug where nvidia-384 was hanging (needed -y flag!).
However, now when trying to run the nvidia-docker, I get the following error with not loading libcuda.so. I'm confused because I thought nvidia-docker doesn't require Cuda, only nvidia-drivers.
@grinner Work so far on this branch https://github.com/churchlab/ml-ami-builder/tree/fix-gpu-by-simplifying-nvidia-driver-install
[ec2-34-229-135-123.compute-1.amazonaws.com] run: nvidia-docker run -it -v ~/notebooks:/notebooks --workdir /notebooks/mlpe-gfp-pilot/src/python/scripts --entrypoint python mlpe-gfp-pilot-test-docker-image test_gpu.py [ec2-34-229-135-123.compute-1.amazonaws.com] out: docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=8.0 --pid=4759 /var/lib/docker/overlay2/c7662ee96e47b8495267e5d615ca6187a2721774ce2720e86e736aa5e9f6d6a6/merged]\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\n\\"\"": unknown.
Nailed it. Had to remove -no-install-recommends
Tests pass!
Results for tsting ami-ba9b98c0:
c5.xlarge PASS
m4.2xlarge PASS
g3.4xlarge PASS
p3.2xlarge PASS
p3.8xlarge PASS
Provisioned a new instance of the latest AMI ami-91146aeb on a GPU instance type.
Then I ssh'ed in and tried running:
I get the error:
I tried this for each of clean AMIs of p2.xlarge, g3.4xlarge, and p3.2xlarge.