GPU Utilization - Githubissues

devsisters / DQN-tensorflow

Tensorflow implementation of Human-Level Control through Deep Reinforcement Learning

MIT License

2.48k stars 762 forks source link

GPU Utilization #21

Open ch3njust1n opened 7 years ago

ch3njust1n commented 7 years ago

I have a Titan X and have been running the Breakout simulation for over two days now and it's only 7% through training and nvidia-smi is showing that it's only using 4-5%. The README.md says that it only took 30 hours on a 980. That doesn't seem right. According to main.py, it should be using 100% by default if I don't give the flag. Is anyone else having this issue or is it just me? Edit: nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE shows that FB Memory Usage is 11423 MiB/ 12185 Mib. Does that look correct if using the default GPU setting for Breakout?

infin8Recursion commented 7 years ago

Any luck so far solving the issue?

I am having the same problem with my GTX 1080. Its performance degrades after an hour or so. It starts with 250 it/s with estimated time to finish around 45 hours then drops to 75 it/s with estimated time around 170 hours.

serialx commented 7 years ago

it/s dropping is normal since the agent learns to survive, so each game tends to take longer. I don't know if it's normal for Titan X to take only 7% load.

infin8Recursion commented 7 years ago

Isn't it supposed to finish training in 24~30 hours? It did that using 980ti. However, it does not seem to be the case with Titan X and 1080 even though they outperform it.

Any suggestion about what could be causing such behavior?

@serialx Could you please share with us your setup and the time it took to finish training?

ch3njust1n commented 7 years ago

@infin8Recursion No luck so far.

carpedm20 commented 7 years ago

I figured out this issue now and there may be a bug among the recent commits. I'll dig in to this and update this.

slowbull commented 7 years ago

Is this issue solved yet ?

Lan1991Xu commented 7 years ago

In my case，it is also take so long time. The GPU utilisation is about 50%, but the training time need around 500hours to complete 500 000 00 steps, which almost one month~

zcyang commented 7 years ago

any update on this?

carpedm20 commented 7 years ago

We don't have an explicit schedule to fix this bug but I recommend you to try other great DQN implementations in TensorFlow like https://github.com/dennybritz/reinforcement-learning or https://github.com/carpedm20/deep-rl-tensorflow

shengwa commented 7 years ago

Same problem here. I'm trying to use the repository https://github.com/carpedm20/deep-rl-tensorflow instead.

ionelhosu commented 7 years ago

Same problem for me. On my GTX1070, this repo runs at ~90 iter/sec. https://github.com/carpedm20/deep-rl-tensorflow is faster, at ~120 iter/sec, but by far the fastest implementation (at least for my hardware) is https://github.com/matthiasplappert/keras-rl , running at ~190 iter/sec. If anyone knows faster implementations, feel free to link them here. I'm looking for the fastest possible implementations since I'm doing a load of experiments for 200 million steps, and those 10 iter/sec may result in finishing one experiment half a day sooner.

ppwwyyxx commented 7 years ago

@ionelhosu Just wanted to point out that it's very hard to compare speed of DQN implementations apple-to-apple. Apart from network and the algorithm (dqn /double dqn, etc), other things can also be different. The most subtle one is "what does each iteration mean". Usually each iteration may include : going forward certain steps in the environment, by either random exploration or using a network, and maybe sample a batch and train on it. The bold parts are all controlled by hyper parameters and is hard to make consistent. Also, due to epsilon-annealing in DQN, the speed is not a constant across training, but gradually going slower as controlled by hyper parameters.