cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

can't find libcudnn.so for tensorflow #421

Closed gideonite closed 8 years ago

gideonite commented 8 years ago

Tensorflow v0.8 wants to use libcudnn.so but we don't seem to have it installed on our system. I have searched for it manually in /usr/lib64/ and using the locate command.

I am using Tensorflow's docker image, have made sure to get access to a Titan GPU and have run module load cuda Anaconda.

Thanks.

tatarsky commented 8 years ago

I don't believe libcudnn is installed.

I'll double check that and see whats involved in adding it. But this is no longer where such requests should be made. Please email:

hpc-support@cbio.mskcc.org

tatarsky commented 8 years ago

Whoops, didn't mean to close that one. Please do open a ticket via the support email. Second question: I'm confused by your statement you are "using a docker image" .

Wouldn't that imply the docker image would need to contain the needed library as docker is a chroot environment? Whats the actual error you are getting? (Please note I do still believe CuDNN is NOT installed in shared areas...others may have it out there)

tatarsky commented 8 years ago

Oh, weird. I already have installed this but forgot.

$ module add cudnn
$ echo $LD_LIBRARY_PATH
/cbio/shared/software/cudnn/7.0/lib64
$ ls /cbio/shared/software/cudnn/7.0/lib64
libcudnn.so  libcudnn.so.7.0  libcudnn.so.7.0.64  libcudnn_static.a

Can you see if that works or if it needs a refresh? I show 7.0 as the current version still.

gideonite commented 8 years ago

Interesting, the cudnn module is now appearing in the module avail list. I don't think it was there before.This is evidenced by a longer standing active job that does not have the cudnn module.

So when I request a fresh job, the cudnn module is now available and once loaded, the system state is as you say: the .so files are in /cbio/shared/software/cudnn/7.0/lib64.

I wound up fixing the problem by upgrading the nvida base image that I was using for the tensorflow docker image. The default tensorflow image (the one listed on their tutorial) builds on cudnn4-devel. Currently I am using nvidia/cuda:7.5-cudnn5-devel. This fixed the problem.

tatarsky commented 8 years ago

Might double check. I noted the module file may have been not up to date on the nodes and made sure it was sync'd.

tatarsky commented 8 years ago

To be clear, the double check is only if you care about the non-docker module need...I understood your last sentence...thanks for reporting the matter.

tatarsky commented 8 years ago

We are updating and likely renaming the CUDNN module as the "7.0" is really older "3.0" based on the header file.

#define CUDNN_MAJOR      3
#define CUDNN_MINOR      0
#define CUDNN_PATCHLEVEL 07

We have as non-default version of CUDNN the following two modules representing CUDA 7.5 compatible 5.0 and 5.1rc1 versions respectively:

module load cudnn/5.0
--or--
module load cudnn/5.1

Please test them at your convenience.

We will likely delete or rename the "7.0" to "3.0" but we'd also probably be wise to change the default to the more current version.

There is no urgency to this but the topic came up in the bug tracking system.

tatarsky commented 8 years ago

These new modules appear to work. We will rename "7.0" to "3.0" shortly.

After that I would strongly recommend people test via the newer modules (5.0 and 5.1) and we can change the default. I don't know who all uses cudnn except @gideonite and @lzamparo.

tatarsky commented 8 years ago

We will rename the "7.0" module to "3.0" in the middle of the week. It will remain default for perhaps another week.

lzamparo commented 8 years ago

I can vouch for 5.0, works great on all my tests.

tatarsky commented 8 years ago

Cool. I'm mostly just being paranoid as I don't know who else used this module.

tatarsky commented 8 years ago

Probably makes more sense to just make 5.0 the default and listen for problems. Then delete this old one.

tatarsky commented 8 years ago

I'll make one more Git announce of this and then do the actions. I'll make it via a top level one though as I think this one isn't being seen by anyone. Closing.