Issue of multiprocessing when training

michaelliyunhao commented 3 years ago

Hi, Thanks for sharing this excellent work. I met some problems when runing the training code about multiprocessing. The training code "python motion_imitation/motion_im.py --cfg 0506 --num_threads " works well when I only use one cpu core. But it stuck when I set the num_threads over 1(such as 2,4,12), the problem seems to be happened in "agent.py": "pod, worker_memory, worker_logger = queue.get()" when the ppo algorithm starts sampling. The multiprocessing process seems work okay in the first sampling iteration, but the queue start to be empty in the second iteration and then the whole training process stuck. I'm not familiar with gym and multiprocessing, do you know how to fix it?

Best Regards.

Khrylx commented 3 years ago

Hi,

You can try to set export OMP_NUM_THREADS=1 in your .bashrc. Also, for some pytorch version such as 1.12, such mutliprocessing problem can happen. You can try to update to newer pytorch version.

michaelliyunhao commented 3 years ago

Thanks! Setting export OMP_NUM_THREADS=1 really works.

Khrylx / RFC

Issue of multiprocessing when training #3