NVlabs / GA3C

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.
BSD 3-Clause "New" or "Revised" License
652 stars 195 forks source link

Suggested Config.py settings for a DGX-1 #5

Open ProgramItUp opened 7 years ago

ProgramItUp commented 7 years ago

After running the _train.sh with the default Config.py on a DGX-1 for about an hour I see that the CPU usage stays pretty constant at about 15%, and one GPU is being used at about 40%.

The settngs in Config.py are unchanged: DYNAMIC_SETTINGS = True. The number of trainers varies between 2 and 6, the number of predictors varies between 1 and 2 and the number of agents varies from 34 to 39. I would have expected them to grow to use the available CPU resources.

1) Are there settings that will better leverage the cores on a DGX-1? 2) It looks like the code in NetworkVP.py is written for a single GPU. With TensorFlow's support for multiple GPU's, do you have plans to add it? On the surface it seems pretty easy to add:

for d in ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']:    
    with tf.device(d):
       .... calcs here...
ifrosio commented 7 years ago
  1. We cannot answer to this without experimenting. The best approach may be to do a grid search and make sure that dynamic scheduling is close to optimal, as we expect.
  2. We are working on a multi-GPU implementation. When using a small DNN (e.g. the default A3C network), the bottleneck is the GPU-CPU communication, so adding more GPUs naively does not help in this case. A more sophisticated method is required to leverage the computational power of multiple GPUs.
developeralgo8888 commented 7 years ago

When do you expect Multi GPUs implementation is going to be ready? . 99% of researchers or AI users use Multiple Nvidia GPUs on a single system for research, tests and quick training before they pull in the Big Guns -- Grid Super computers. I am not sure why your team never thought of the Multiple GPUs implementation first ? It would have made your code very efficient in using multiple GPUs by simply selecting the number of GPUs to use 1, 2, 3 4, or 8 .

mbz commented 7 years ago

@developeralgo8888 we don't have an eta yet but working on it. As @ifrosio mentioned, a naive multi-GPU implementing does not improve the converge rate and may cause instabilities. A naive data parallelism implementation (which I believe is what you are suggesting %99 of researchers are using) will put more pressure on GA3C bottleneck (i.e. CPU-GPU communication) and therefore the return is nothing. Feel free to implement the code you are suggesting (it shouldn't be more than two lines of code as you suggested) but it's very unlikely that improves the performance.