glebkuznetsov commented 6 years ago

TL;DR:

Getting this error after a while of using a GPU machine built from the wyss-mlpe AMI:

nvidia-docker | 2017/10/06 22:04:12 Error: unsupported CUDA version: driver 0.0 < image 8.0.61

Once you get this error, it’s no longer possible to run nvidia-docker and thus have to go make a new machine. (Fortunately with our workflow being in docker now this is not that painful)

More details:

We can’t tell exactly what triggers this issue, but it’s something to do interrupting a docker container in any other way than ctrl+c. It seems the machine gets confused about where the Nvidia driver is.

I suspect we’re doing something funky in the nvidia install but can’t really tell what.

@cancan101 pointed out that they encountered a possibly-related issue at Butterfly where Ubuntu was auto-updating nvidia drivers, which was breaking things. They are now using 384.x and TensorFlow 1.3 docker image from google, which works.

Possibly helpful

https://medium.com/@flavienguillocheau/documenting-docker-with-gpu-deep-learning-for-noobs-2edd350ab2f7

cancan101 commented 6 years ago

We saw the following in /var/log/apt/history.log:

Commandline: /usr/bin/unattended-upgrade
Install:
...
nvidia-384:amd64 (384.66-0ubuntu1, 384.90-0ubuntu0.16.04.1),

grinner commented 6 years ago

This is fixed with recent commits. We can run on p3s now

churchlab / ml-ami-builder

nvidia-docker Cuda driver issue #6

TL;DR:

More details:

Possibly helpful