NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.19k stars 2.03k forks source link

nvidia-caffe and nvidia-digits docker support for cuda8.0? #209

Closed kertansul closed 7 years ago

kertansul commented 7 years ago

Hi, I'm using a GTX1080 with nvidia-docker/digits and getting error message while running AlexNet:

_relu2 needs backward computation. conv2 needs backward computation. pool1 needs backward computation. norm1 needs backward computation. relu1 needs backward computation. conv1 needs backward computation. label_val-data_1_split does not need backward computation. val-data does not need backward computation. This network produces output accuracy This network produces output loss Network initialization done. Solver scaffolding done. Starting Optimization Solving Learning Rate Policy: step Iteration 0, Testing net (#0) Ignoring source layer train-data Test net output #0: accuracy = 0.0999041 Test net output #1: loss = 2.30515 (* 1 = 2.30515 loss) Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCHFAILURE

Checked the nvidia/digits github and it seems to be something related to cuda7.5: https://github.com/NVIDIA/DIGITS/issues/925 However, I wanted to use containerization for deep learning frameworks.

Will nvidia update the docker images for cuda8.0? Or how could I build nvidia-caffe and nvidia-digits dockerfiles for cuda8.0?

digits

3XX0 commented 7 years ago

We will provide new CUDA 8.0 images eventually. In the meantime, see this comment

kertansul commented 7 years ago

@3XX0 Thanks! I missed that thread during search..

So once I built the nvidia/caffe with cuda8.0, how should I tweak the nvidia/digits?

3XX0 commented 7 years ago

Once you have the caffe image the only thing you need to do is rebuild the digits one. You can change the FROM directive to point to your local caffe image.

If you already tagged it with the same name (i.e. caffe:0.15) then you can directly rebuild digits with make -C ubuntu-14.04/digits 4.0

kertansul commented 7 years ago

@3XX0 I'm stuck at error while running nvidia-docker/ubuntu-14.04/digits/4.0/Dockerfile:

_Step 6 : RUN apt-get update && apt-get install -y --no-install-recommends --force-yes torch7-nv=0.9.99-1+cuda8.0 graphviz gcc libhdf5-dev digits=$DIGITS_PKGVERSION && rm -rf /var/lib/apt/lists/*

after a couple of lines ......

Fetched 22.2 MB in 29s (756 kB/s) Reading package lists... Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package torch7-nv E: Unable to locate package digits

I've tried:

1) Rebuild caffe based on issue208 and tag it with name caffe:0.15, then rebuild digits with make -C ubuntu-14.04/digits 4.0 => Success, but the rebuilding procedure seems to replace my self-built caffe:0.15 (cuda8.0) with the original caffe:0.15 (cuda7.5). Tested AlexNet on Cifar10, still hit the same old error

2) To prevent the image replacement, tried to modify nvidia-docker/mk/caffe.mk: under 0.15: 7.5-cudnn5-runtime to 8.0-cudnn5-runtime under 0.15: comment out $(NV_DOCKER) build -t caffe:$@ $(CURDIR)/$@ and then issue make -C ubuntu-14.04/digits 4.0. I was able to generate cuda images with tag 8.0-runtime and 8.0-cudnn5-runtime but stuck at the "Unable to locate package"..

Also tried using the original parameters "torch7-nv=0.9.99-1+cuda7.5" but nothing changes

flx42 commented 7 years ago

@kertansul you need to add this line: https://github.com/NVIDIA/nvidia-docker/blob/master/ubuntu-14.04/cuda/7.5/runtime/cudnn5/Dockerfile#L4 This is the package containing torch7-nv and digits.

But be careful that if you install torch7-nv through this repo, you will get the CUDA 7.5 version. For DIGITS it doesn't matter.

kertansul commented 7 years ago

@flx42 hi, I add the line before ENV DIGITS_PKG_VERSION 4.0.0-1, bump into 2 errors

Error 1: NO_PUBKEY F60F4B3D7FA2AF80 Solved by adding RUN wget -qO - http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/7fa2af80.pub | sudo apt-key add - based on this reference

Error 2: The following packages have unmet dependencies: digits : Depends: python-caffe-nv (>= 0.13) but it is not going to be installed Depends: caffe-nv (>= 0.13) but it is not going to be installed torch7-nv : Depends: cuda-cudart-7-5 but it is not installable Depends: cuda-curand-7-5 but it is not installable Depends: cuda-cublas-7-5 but it is not installable Depends: cuda-ld-conf-7-5 but it is not going to be installed Depends: cuda-license-7-5 but it is not installable Depends: libnccl1 (>= 1.1.1) but it is not going to be installed E: Unable to correct problems, you have held broken packages.

I'm guessing this is happening because I'm mixing up cuda8.0 and cuda7.5 ... Tried adding apt-get install cuda but results in Unable to locate package What am I missing?

flx42 commented 7 years ago

Test with the new images, they support CUDA 8.0 now. However, we don't have DIGITS 5.0 yet.