NVlabs / GA3C

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.
BSD 3-Clause "New" or "Revised" License
652 stars 195 forks source link

GA3C source code has High CPU usage causing System freeze or crash #20

Open developeralgo8888 opened 7 years ago

developeralgo8888 commented 7 years ago

The code runs fine but leaks CPU and Memory and will crush your system . I am using Glances diagnostic or monitoring tool ( pip install glances ) . You will notice that if you leave your code running for a long time the CPU context switches increases substantially and the CPU & Memory keeps increasing until your code hangs or crushes . CPU usage increased from 6.7% to 64% and Memory from 10% to 79% at that point it caused the system freeze. When i look at the Nvidia TITAN X ( Maxwell --12 GB mem) usage it is only using about 300 MB out 12 GB. So it seems while most of the heavy lifting should be offloaded to the GPU in this case it does not seem to be the case. I have 8 x TITAN Maxwell GPUs with 2 x Intel Xeon 2660 v3 (2 CPU with total 40 CPU Cores ) with 128GB of DDR4 memory and i can use any of them . Still i get same results , the CPU will keep increasing

Any insights?

Other original A3C or various hybrid ( CPU & GPU ) versions seem to offload most of the heavy lifting to GPU and causes no system freezes but not with GA3C

Testing it on various amounts of data and games

mbz commented 7 years ago

That's an interesting observation. I've tested the code on a Maxwell TITAN X myself and didn't observe such behavior. Can you please share the version of your libraries (python, TensorFlow, cuda, ...) . My (blind) guess is that this is a problem with TensorFlow. It would be great if you share your Motherboard spec since PCI-E is the bottleneck here.

Two side notes:

  1. The low memory usage is due small model size. Please note that neither A3C nor GA3C have any "experience memory" so they do not utilize GPU memory as an storage and the only stored object is the model itself. But I will be interested in your GPU-utilization (check with nvidia-smi command).
  2. The current version of the code is single-GPU so you currently cannot utilize more than one-GPU.
ifrosio commented 7 years ago

It is also interesting understanding if the number of agents is increasing during training. That may explain the increase in CPU usage.

Sent from my iPhone Sory ForSpell Ing hErRRors :)

On Mar 19, 2017, at 11:36 AM, Mohammad Babaeizadeh notifications@github.com<mailto:notifications@github.com> wrote:

That's an interesting observation. I've tested the code on a Maxwell TITAN X myself and didn't observe such behavior. Can you please share the version of your libraries (python, TensorFlow, cuda, ...) . My (blind) guess is that this is a problem with TensorFlow. It would be great if you share your Motherboard spec since PCI-E is the bottleneck here.

Two side notes:

  1. The low memory usage is due small model size. Please note that neither A3C nor GA3C have any "experience memory" so they do not utilize GPU memory as an storage and the only stored object is the model itself. But I will be interested in your GPU-utilization (check with nvidia-smi command).
  2. The current version of the code is single-GPU so you currently cannot utilize more than one-GPU.

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/NVlabs/GA3C/issues/20#issuecomment-287636992, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APNcGlo3t01noF41xG_hzFurrFEpCLQtks5rnXWQgaJpZM4MhwML.


This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

mbz commented 7 years ago

@ifrosio that's a very good point. @developeralgo8888 please try with DYNAMIC_SETTINGS=False

developeralgo8888 commented 7 years ago

High_CPU_and_memory.txt

developeralgo8888 commented 7 years ago

Please find attached. i restarted the run and it has started increasing as we go .

developeralgo8888 commented 7 years ago

with DYNAMIC_SETTINGS=False ,

The CPU remains stable but you do have memory leak . The memory keeps increasing until the system freeze

i have attached the snapshots which are roughly 12 hours apart High_CPU_and_memory.txt