Kaixhin / Atari

Persistent advantage learning dueling double DQN for the Arcade Learning Environment
MIT License
264 stars 74 forks source link

Async A3C Network Outputs NaN #50

Closed lordzapharos closed 8 years ago

lordzapharos commented 8 years ago

Fresh Torch7 install here on Linux Mint 17 (not using CUDA). I can run all of the demo examples (demo, demo-grid, demo-async, and demo-async-a3c) without issue. Regular DQN and async-nstep also run without issue on Montezuma's Revenge. However, when running async-a3c, I get an error bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at <torchPath>/lib/TH/generic/THTensorRandom.c:120) shortly after training begins.

The problem occurs at A3CAgent.lua, line 54 -- my own print statements have confirmed that the outputs of the network (probability, obtained on the previous line) are all NaN. Adding NaN checks in Model.lua showed that NaNs are being found in the nn.SpatialConvolution 64x64 layer after only a few iterations of training. The problem occurs intermittently (you may need to run it several times before getting the error).

Neither an update nor complete reinstall of Torch solved the issue. I have verified that the inputs to the network (passed into A3CAgent.lua, line 54 as state) are between 0 and 1, and it does not appear as if any of the training gradients in A3CAgent:accumulateGradients() are producing Inf or NaN.

The issue also occurs when running on a Redhat cluster GPU. Any thoughts?

Kaixhin commented 8 years ago

Thanks for all the detail - pretty sure that this is a numerical instability issue that can occur with a softmax output.

@lake4790k I'm guessing the fix on line 88 can also be added after line 53?

lordzapharos commented 8 years ago

Adding the small epsilon after line 53 does not solve the issue. This looks like it might be a race condition -- setting the number of threads to 1 eliminates the problem entirely (though of course that defeats the whole purpose of A3C). The problem only occurs when two or more threads are being used.

lake4790k commented 8 years ago

This sounds like the network params get infected with nans like earlier. There were 2 causes for this, but both had been fixed (proper sharedRmsProp logic and the proper tiny epsilon). It is strange I had a3c running long and didn't get this with current code and you get it quickly. Could be something different with the environment. Maybe check if OpenBLAS threading is not interfering: if you run only one a3c thread only one cpu core should be working. Btw do you have OpenBLAS properly installed and working with Torch?

I still don't have my proper machines set up to run experiments and don't want to torture my laptop for too long with this (after a short run I don't get this), but will put together finally my 4790k and try to reproduce this!

lordzapharos commented 8 years ago

That did the trick -- either my version of OpenBLAS was out of date or Torch wasn't recognizing it fully. A fresh install of the current version of OpenBLAS solved the issue.

Thanks for the insights!