Open parulaggarwal opened 3 years ago
I found that it happens when I start another X server (even if it is on another GPU and with another display number). As soon as I start the second X server, the first process running shows the above mentioned behavior. It looks like AI2Thor cannot be run on more than one GPUs if we want to run it parallely. Can someone please guide me here how I can achieve this.
We run processes on multiple GPUs for various projects. Let us reproduce the issue and we will get back to you.
Just to clarify, to reproduce do you run your Docker container twice? Or does simply launching a second X server without running any ai2thor code while the Docker container is running reproduce your error?
Simply launching a second X server creates a problem. I launch X servers inside docker container.
Suppose there are 2 GPUs, these are the steps:
Okay, I was able to reproduce your error in a few different ways. I believe we are running into a nvidia driver / opengl and Xorg issue. To address the first issue of running AI2Thor in parallel, this works fine as long as you start a Xorg server then launch 1 or more processes against that DISPLAY (the colorBounds count remains stable across all procs). If I started another Xorg server while a process was running my Unity window would disappear and an error would appear in the original Xorg.log. What this means is that if you want to use Docker and start Xorg within the container, you need to launch all of your ai2thor processes within the same container and avoid starting additional Xorg processes. Alternatively, if you are using ai2thor-docker and have a DISPLAY environment variable set, the container will receive the parameters to use that display instead of starting a new one.
I have some fixed trajectories. When I run ai2thor (version 2.1.0) inside a docker container and try to replicate some trajectories, it happens that sometimes the length of the metadata['colorBounds'] goes as huge as 20,000 and the container occupies huge amount of memory and the processing becomes extremely slow. The same image at other instances runs fine for some trajectories and then gets stuck. Yet at other instances the same image runs fine for all trajectories.
I have attached the logs (though I could not spot anything unusual in the logs). AI2-Thor.zip
The steps I follow are:
(CUDA_VERSION=10.0). Here is the Dockerfile