facebookresearch / habitat-sim

A flexible, high-performance 3D simulator for Embodied AI research.
https://aihabitat.org/
MIT License
2.59k stars 419 forks source link

Using habitat in Pytorch multiprocessing #664

Closed pushkalkatara closed 4 years ago

pushkalkatara commented 4 years ago

I am trying to spawn mulitple habitat enviroments parallely. Threads are created using Torch multiprocessing, but while running the thread and starting the habitat environment, I get the following error

(habitat) Singularity> python train.py 
Shared Network Created
Training started
Shared Network Created
Shared Network Created
Thread  0  ready
sim_cfg.physics_config_file = ./data/default.phys_scene_config.json
Thread  1  ready
sim_cfg.physics_config_file = ./data/default.phys_scene_config.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0627 16:50:59.881245  5525 WindowlessContext.cpp:114] Check failed: eglDevId < numDevices [EGL] Could not find an EGL device for CUDA device 1
*** Check failure stack trace: ***
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0627 16:50:59.904803  5526 WindowlessContext.cpp:114] Check failed: eglDevId < numDevices [EGL] Could not find an EGL device for CUDA device 1
*** Check failure stack trace: ***

Also, I read this solution to the issue and set torch.multiprocessing.set_start_method('forkserver'), but gets the error Context has been already set.

Also, this error only occures when I load the RL policy to GPU. If I load the model on CPU the habitat environment loads fine.

How should I solve the issue and parallely run multiple habitat environments? The environment configs are here. I'm using Pytorch implementation of A3C.

erikwijmans commented 4 years ago

As said in the issue you linked, you need to use forkserver or spawn, there isn't a way around this as its a core limitation of EGL. PyTorch CUDA memory sharing also requires (https://pytorch.org/docs/stable/multiprocessing.html#sharing-cuda-tensors). Where is the Context has been already set error coming from? I don't recall where the is in our code.

pushkalkatara commented 4 years ago

I traced the error in the starting of the script while importing torch and setting the start method. Adding a flag force=True to the torch.multiprocessing.set_start_method('forkserver') solved the Context has already been set error. It was more of a Pytorch issue rather than Habitat, thanks for the help.