arthurxlw / cytonRL

reinforcement learning, deep Q-network, double DQN, dueling DQN, prioritized experience replay
Apache License 2.0
30 stars 6 forks source link

cuDNN Error: CUDNN_STATUS_NOT_INITIALIZED src/cytonLib/Global.cu 67 #1

Open ZhuXingJune opened 5 years ago

ZhuXingJune commented 5 years ago

The problem when I train a model after compiling will be like this:

A.L.E: Arcade Learning Environment (version 0.6.0) [Powered by Stella] Use -help for help screen. Warning: couldn't load settings file: ./ale.cfg version 1.0 mode: train batchSize: 32 dueling: 1 env: roms/breakout.bin gamma: 0.99 inputFrames: 4 learningRate: 0.0000625 learnStart: 50000 loadModel:
maxEpisodeSteps: 18000 maxSteps: 100000000 optimizer: RMSprop priorityAlpha: 0.6 priorityBeta: 0.4 progPeriod: 10000 replayMemory: 1000000 saveModel: model/model savePeriod: 1000000 showScreen: 0 targetQ: 30000 testEGreedy: 0.001 testEpisodes: 100 testMaxEpisodeSteps: 18000 testPeriod: 5000000 updatePeriod: 4 networkSize: 32:64:64:512 eGreedy: 1.0:0.01:5000000 cuDNN Error: CUDNN_STATUS_NOT_INITIALIZED src/cytonLib/Global.cu 67 cytonRl: src/cytonLib/basicHeads.cu:52: cudnnStatust cytonLib::checkError(cudnnStatus_t, const char*, int): Assertion `false' failed. Aborted (core dumped)

And my Driver Version is: NVIDIA-SMI 396.26, cuda=9.2.88, cudnn=7.1.4 What's wrong with it?

Well, I will really appreciate your answer. @arthurxlw

arthurxlw commented 5 years ago

Hi June, It looks like an environment setting problem. Maybe first check whether the versions of CUDA and CUDNN match. Then check whether the memory of GPU is big enough.

ZhuXingJune commented 5 years ago

Thank for your answer! But I have used "cuda=8.0 , cudnn=6.0" and "cuda=9.2.88, cudnn=7.1.4", and there is the same bug for both of them. In addition, I find that in line 66 of Global.cu, "cudnnCreate(&cudnnHandle)" return 1, which is "CUDNN_STATUS_NOT_INITIALIZED". Is it also the problem of environment setting? @arthurxlw

ZhuXingJune commented 5 years ago

Oh, I think I made a stupid mistake, and the problem is my wrong operation on my server.