Drastic drop in GPU speed after approximately 10 games

I'm getting an unexpected, but reliable, drop in GPU speed after running 11-ish games of Chapter07/01_dqn_basic.py using the --cuda option. For the first 10 games, I get speeds comparable with the textbook, but speed halves in game 11, and then down to a third in games 12+. I get similar behavior when running Chapter06/02_dqn.pong.py on the GPU too. This happens every time I run the code.

I'm running Python 3.6.10 on Windows 10. I'm using all the textbook's required packages, with the exception of Pytorch. Textbook recommends pytorch==0.4.0, but I couldn't get it to run so I installed pytorch==1.1.0 with cudatoolkits==10.0. I'm using an NVIDIA TITAN RTX, and my PC has 128GB ram.

I also get similar speeds (~400 f/s) when I run the code withOUT the --cuda option for the first 10 or so games but then it drops to around 10 f/s after that.

I have no idea what could be the cause of this. I'm relatively new to RL, CUDA and Python, so I'm not sure if it's a problem with the example code or something on my end.

Any ideas? Has anyone else reported this, or is it just me?

01_dqn_basic py cuda performance

Hi! Speed drop after first several games is normal, as in the beginning, no training is done, as we're populating replay buffer. Training is the most heavy operation, so, 5-10x slowdown is ok.

But, of course it doesn't explain why you're getting slower speed than in the book. It could be wide variety of reasons for that: slower card (I've used gtx 1080ti), wrong drivers setup or just overheating.

пт, 3 июл. 2020 г., 8:37 alanballard notifications@github.com:

I'm getting an unexpected, but reliable, drop in GPU speed after running 11-ish games of Chapter07/01_dqn_basic.py using the --cuda option. For the first 10 games, I get speeds comparable with the textbook, but speed halves in game 11, and then down to a third in games 12+. I get similar behavior when running Chapter06/02_dqn.pong.py on the GPU too. This happens every time I run the code.

I'm running Python 3.6.10 on Windows 10. I'm using all the textbook's required packages, with the exception of Pytorch. Textbook recommends pytorch==0.4.0, but I couldn't get it to run so I installed pytorch==1.1.0 with cudatoolkits==10.0. I'm using an NVIDIA TITAN RTX, and my PC has 128GB ram.

I also get similar speeds (~400 f/s) when I run the code withOUT the --cuda option for the first 10 or so games but then it drops to around 40 f/s after that.

I have no idea what could be the cause of this. I'm relatively new to RL, CUDA and Python, so I'm not sure if it's a problem with the example code or something on my end.

Any ideas? Has anyone else reported this, or is it just me?

[image: 01_dqn_basic py cuda performance] https://user-images.githubusercontent.com/39736471/86434283-f5f61580-bcb1-11ea-85ec-723734860c57.PNG

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/issues/80, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQE2VSO7O7YJEVPOFWXWLRZVVC3ANCNFSM4OPPWUTQ .

Thank you for your reply. It's been a struggle since I'm learning Python and RL at the same time (not too mention debugging GPU issues), but I've really enjoyed your book so far.

I'm using a NVIDIA TITAN RTX, so I would expect the performance to be at least as good as a GTX 1080ti, if not better. I've updated the drivers but there is no change in performance.

If I use the --cuda option, my GPU usage never exceeds 3% and my CPU usage is approximately 60% for all games. However, the speed drops from ~400 f/s to ~80 f/s after (approximately) game 10 or 11.

If I do not use the --cuda option, then my GPU usage is about 1% (same as internet browsing) and my CPU goes to 100% after game 10. As before, the speed for games #1-#10 is ~400 f/s, but without the cuda option, the speed drops to ~10 f/s after game 10 or 11.

So, for the first 10 games, I can achieve performance close to the textbook's whether I use the --cuda option or not. In either case, the speed dramatically drops after game 10, and my GPU usage is never greater than 3% regardless of which option I choose.

This behavior is almost identical to that reported here: Issue #32

Maxim (or anyone else who might be reading this), when you have time, would you mind running Chapter07/01_dqn_basic.py --cuda from the 1st edition of the book and answer these questions:

1) In Chapter 7, pg, 167, you only report up to the 9th game. Can you let it run until game 15 or so and see if you also experience a really significant drop in speed during the game 10-15 range? 2) What % of your GPU and CPU are you using when you execute Chapter07/01_dqn_basic.py --cuda?

I'm not convinced that it's a bad thing that my GPU usage is so low. It's possible that even with maximum parallelization using the current code, I simply can't use more than 3% of my available GPU resources playing pong. If that's the case, then there may be another reason that the speed is so slow during training. I don't think there's any problem with your code, but it may be an issue of old packages vs. new Windows/GPU specs, or maybe old packages doing something unwelcome in my environment (like unnecessarily copying tensors).

I'm going to clone the git for your 2nd edition book, create a Python environment with the new package requirements and re-test the 2nd-edition version of the code there. That should at least let me know whether it's a package version issue.

Thank you for your help.

Below are benchmarks on my hardware (1080Ti, nvidia drivers 440.100, cuda 10.2, ubuntu) of the first edition code:

804: done 1 games, mean reward -21.000, speed 585.91 f/s, eps 0.99
1732: done 2 games, mean reward -20.500, speed 722.61 f/s, eps 0.98
2797: done 3 games, mean reward -20.000, speed 721.65 f/s, eps 0.97
3665: done 4 games, mean reward -20.250, speed 721.01 f/s, eps 0.96
4453: done 5 games, mean reward -20.400, speed 719.14 f/s, eps 0.96
5328: done 6 games, mean reward -20.333, speed 720.16 f/s, eps 0.95
6146: done 7 games, mean reward -20.429, speed 721.55 f/s, eps 0.94
7269: done 8 games, mean reward -20.375, speed 718.43 f/s, eps 0.93
8534: done 9 games, mean reward -20.111, speed 712.01 f/s, eps 0.91
9495: done 10 games, mean reward -20.200, speed 722.72 f/s, eps 0.91
10461: done 11 games, mean reward -20.273, speed 244.70 f/s, eps 0.90
11408: done 12 games, mean reward -20.250, speed 147.07 f/s, eps 0.89
12264: done 13 games, mean reward -20.308, speed 146.79 f/s, eps 0.88
13239: done 14 games, mean reward -20.286, speed 146.91 f/s, eps 0.87
14048: done 15 games, mean reward -20.333, speed 146.55 f/s, eps 0.86

During the training, GPU utilisation is about 35%

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 34%   63C    P2    76W / 250W |    583MiB / 11178MiB |     34%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   37C    P8     9W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4404      C   python3                                      573MiB |
+-----------------------------------------------------------------------------+

Without --cuda option, I'm getting 510 f/s during replay buffer population (first 10 games) and then, speed is decreasing to 15 f/s. So, speed up is 80-100 times, as it should be.

I see your hardware is much better than mine (my CPU is i5-6600k, system has 32GB of RAM, so, numbers should be better.

I'd start with general system/gpu troubleshooting by running standard deep learning benchmarks (like this one: https://github.com/ryujaehun/pytorch-gpu-benchmark) and comparing the numbers on them.

You might also try to take Chapter09 examples from the second edition. This chapter is devoted to apply GPU/PyTorch tricks to speed up pong game, so, it has plenty of numbers to compare with. Here is the summary of chapter 9 samples performance on my hardware: https://www.dropbox.com/s/qz1tghrv1029efv/Chapter09-benchmarks.png?dl=0

PacktPublishing / Deep-Reinforcement-Learning-Hands-On

Drastic drop in GPU speed after approximately 10 games #80