Closed gideonite closed 8 years ago
I don't believe libcudnn
is installed.
I'll double check that and see whats involved in adding it. But this is no longer where such requests should be made. Please email:
hpc-support@cbio.mskcc.org
Whoops, didn't mean to close that one. Please do open a ticket via the support email. Second question: I'm confused by your statement you are "using a docker image" .
Wouldn't that imply the docker image would need to contain the needed library as docker is a chroot environment? Whats the actual error you are getting? (Please note I do still believe CuDNN is NOT installed in shared areas...others may have it out there)
Oh, weird. I already have installed this but forgot.
$ module add cudnn
$ echo $LD_LIBRARY_PATH
/cbio/shared/software/cudnn/7.0/lib64
$ ls /cbio/shared/software/cudnn/7.0/lib64
libcudnn.so libcudnn.so.7.0 libcudnn.so.7.0.64 libcudnn_static.a
Can you see if that works or if it needs a refresh? I show 7.0 as the current version still.
Interesting, the cudnn module is now appearing in the module avail
list. I don't think it was there before.This is evidenced by a longer standing active job that does not have the cudnn module.
So when I request a fresh job, the cudnn module is now available and once loaded, the system state is as you say: the .so files are in /cbio/shared/software/cudnn/7.0/lib64
.
I wound up fixing the problem by upgrading the nvida base image that I was using for the tensorflow docker image. The default tensorflow image (the one listed on their tutorial) builds on cudnn4-devel
. Currently I am using nvidia/cuda:7.5-cudnn5-devel
. This fixed the problem.
Might double check. I noted the module file may have been not up to date on the nodes and made sure it was sync'd.
To be clear, the double check is only if you care about the non-docker module need...I understood your last sentence...thanks for reporting the matter.
We are updating and likely renaming the CUDNN module as the "7.0" is really older "3.0" based on the header file.
#define CUDNN_MAJOR 3
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 07
We have as non-default version of CUDNN the following two modules representing CUDA 7.5 compatible 5.0 and 5.1rc1 versions respectively:
module load cudnn/5.0
--or--
module load cudnn/5.1
Please test them at your convenience.
We will likely delete or rename the "7.0" to "3.0" but we'd also probably be wise to change the default to the more current version.
There is no urgency to this but the topic came up in the bug tracking system.
These new modules appear to work. We will rename "7.0" to "3.0" shortly.
After that I would strongly recommend people test via the newer modules (5.0 and 5.1) and we can change the default. I don't know who all uses cudnn except @gideonite and @lzamparo.
We will rename the "7.0" module to "3.0" in the middle of the week. It will remain default for perhaps another week.
I can vouch for 5.0, works great on all my tests.
Cool. I'm mostly just being paranoid as I don't know who else used this module.
Probably makes more sense to just make 5.0 the default and listen for problems. Then delete this old one.
I'll make one more Git announce of this and then do the actions. I'll make it via a top level one though as I think this one isn't being seen by anyone. Closing.
Tensorflow v0.8 wants to use
libcudnn.so
but we don't seem to have it installed on our system. I have searched for it manually in/usr/lib64/
and using thelocate
command.I am using Tensorflow's docker image, have made sure to get access to a Titan GPU and have run
module load cuda Anaconda
.Thanks.