CUDA out of memory - Githubissues

vincent341 commented 3 years ago

"CUDA out of memory" error occurs during training. Would you mind letting me know the minimum GPU capacity requirement for running the code? Or what capacity of GPU do you suggest to run this code?

srama2512 commented 3 years ago

Hi @vincent341,

You might want to change the num_processes and map_batch_size to fit your requirements. The models were trained on 8 GPUs with 16GB memory each. This allowed training on 36 parallel environments spread throughout the GPUs with DataParallel training for the Mapper.

vincent341 commented 3 years ago

Hi @vincent341,

You might want to change the num_processes and map_batch_size to fit your requirements. The models were trained on 8 GPUs with 16GB memory each. This allowed training on 36 parallel environments spread throughout the GPUs with DataParallel training for the Mapper.

Hi @srama2512 , Thanks for your reply. Currently the program can run after setting 'NUM_PROCESSES' in "*.yaml" 1. I have some questions.

Once 'NUM_PROCESSES' was set a number larger than 1, I got an "raise EOFError" error. I'm not sure if it is caused by shortage of cuda memory. My current PC is with a single Geforece 1080, which is an old 8GB GPU. Is it possible to run your program on a PC with a single GPU? How long does the training take in your 8GPUs PC? It would be great if you could give me some hints.
In addition, would you mind letting me know what 'NUM_PROCESSES' variable controls in fact? Is there any relationship between NUM_PROCESSES and number of GPUs?
Regarding "NUM_GLOBAL_UPDATES" in this line.. I suppose this line is the outter loop for training the algorithm. As far as I understand, the iteration of the outter loop for training RL is episode. I found that _NUM_GLOBALUPDATES is computed by _NUM_GLOBAL_UPDATES = self.config.NUM_EPISODES* NUM_GLOBAL_UPDATES_PER_EPISODE// self.config.NUM_PROCESSES_ . Could you please explain more about the meaning of this computation?
Training on 36 parallel environments can accelerate training. Does it benefit the performance of learning in any way?
If I understand correctly, the mapper module is trained by another created process "map_update_func" . I tried to run it in debug mode (Pycharm IDE) to see what is going on. It seems that it's hard to debug multiprocessing python program. Would you mind letting me know how you debug it and tools/IDE you use?

I truely appreciate it if you could offer any help on the above questions.

srama2512 commented 3 years ago

Hi @vincent341,

You could try NUM_PROCESSES=4 and reduce MAPPER.map_batch_size significantly (from 420 to say 16). We have not tested on an 8GB GPU unfortunately. All our experiments were run on 8 GPUs, each with a memory of 16/32 GB for around 2 days.
NUM_PROCESSES controls the number of parallel habitat environment instances. Ideally, you should try to have 6 environments per GPU on a 16 GB GPU. In the sample config, the 36 environments are spread over GPUS [2, 3, 4, 5, 6, 7], and the mapper training is spread over GPUs [1, 2, 3, 4, 5, 6, 7].
Following convention from habitat-baselines, we define the training code based on the number of policy updates. A global action is sampled every ans_config.goal_interval steps of environment interaction. NUM_GLOBAL_STEPS is the number of such global actions to take before updating global policy. NUM_GLOBAL_UPDATES_PER_EPISODE measures how many such updates will happen in an episode. NUM_GLOBAL_UPDATE therefore measures the number of global updates that correspond to the given number of total episodes (NUM_EPISODES) spread over all the parallel environments (NUM_PROCESSES). I hope this clarifies that computation.
The reason for training on 36 parallel environments is that it gives diverse training data for the policy and the mapper. Each environment typically spawns the agent in a different 3D scene from Gibson (72 in total). This was adopted from the ActiveNeuralSLAM project (they use 72 instead of our 36).
This was an optimization to train the mapper parallely while collected data. You could possibly modify the code to update sequentially instead of parallely by making appropriate changes to the mapper update. That is what I did during early versions of the code.

vincent341 commented 3 years ago

Hi @srama2512 ,

Thanks so much for your detailed explanation. It truely helps a lot.

AgentEXPL commented 3 years ago

Hi @vincent341,

You could try NUM_PROCESSES=4 and reduce MAPPER.map_batch_size significantly (from 420 to say 16). We have not tested on an 8GB GPU unfortunately. All our experiments were run on 8 GPUs, each with a memory of 16/32 GB for around 2 days.

NUM_PROCESSES controls the number of parallel habitat environment instances. Ideally, you should try to have 6 environments per GPU on a 16 GB GPU. In the sample config, the 36 environments are spread over GPUS [2, 3, 4, 5, 6, 7], and the mapper training is spread over GPUs [1, 2, 3, 4, 5, 6, 7].

Following convention from habitat-baselines, we define the training code based on the number of policy updates. A global action is sampled every ans_config.goal_interval steps of environment interaction. NUM_GLOBAL_STEPS is the number of such global actions to take before updating global policy. NUM_GLOBAL_UPDATES_PER_EPISODE measures how many such updates will happen in an episode. NUM_GLOBAL_UPDATE therefore measures the number of global updates that correspond to the given number of total episodes (NUM_EPISODES) spread over all the parallel environments (NUM_PROCESSES). I hope this clarifies that computation.

The reason for training on 36 parallel environments is that it gives diverse training data for the policy and the mapper. Each environment typically spawns the agent in a different 3D scene from Gibson (72 in total). This was adopted from the ActiveNeuralSLAM project (they use 72 instead of our 36).

This was an optimization to train the mapper parallely while collected data. You could possibly modify the code to update sequentially instead of parallely by making appropriate changes to the mapper update. That is what I did during early versions of the code.

Hi, @srama2512 . Based on the config parameters, the number of global updates NUM_GLOBAL_UPDATE can be computed as NUM_EPISODES (T_EXP // (NUM_GLOBAL_STEPS goal_interval)) // NUM_PROCESSES = 10000 (1000 // (20 25)) // 36 = 555. In normal case, how many times of updates are needed for a DRL network model to achieve a good performance? Previously I thought that at least ten thousands of updates are needed. (Maybe this is a wrong opinion).

In the equation, greater NUM_PROCESSES means less NUM_GLOBAL_UPDATE. This also makes me confused. Why the number of global updates is impacted by the number of parallel processces which generate data for batches? Hope some explanations could be provided. Thanks!

facebookresearch / OccupancyAnticipation

CUDA out of memory #17