Closed lmlaaron closed 2 years ago
Hi Thanks for this feedback. To be honest, previously our work are mostly on our internal clusters which contains about 80 CPU cores per node. So we didn't see such issues. The issue seems mainly from the moolib and tensorpipe backend. We are working together with moolib team to resolve that. Let's keep this issue open and we will update our progress here.
I also tried it on a AWS g3.8xlarge (32 CPUs and 2GPUs, 240G RAM, 8G RAM/GPU) instance and observed the same error.
The reason of this issue should come from moolib. We have created https://github.com/facebookresearch/moolib/issues/6 to track that. And we will fix it ASAP.
https://github.com/facebookresearch/moolib/pull/7 should fix the issue. Could you try to pull to the latest main and rebuild everything then try to run the example on hour local machine? I tried that on my personal desktop and it should work.
It looks working on my desktop with the following configuration:
CPU Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz GTX 1080 with 8G RAM Ubuntu 16.04 cuda 10.2
after pulling the latest main branch. Though I have to change examples/atari/ppo/conf/conf_ppo.yaml
infer device
to cuda:0
since I only have one cuda device on my desktop.
I was trying to execute the example program atari_ppo.py on the following machine: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz 32GB RAM GTX 1080 with 8G RAM Ubuntu 16.04 cuda 10.2
I have edited my configuration file conf_ppo.yaml to adapt to reduce the resource usage
Here is what I got:
I tried to modify the timeout but seems with the same error. Any hint on how to resolve this?