Unable to run AI2Thor in parallel on multiple GPUs on a node

allenai / ai2thor

An open-source platform for Visual AI.

http://ai2thor.allenai.org

Apache License 2.0

1.14k stars 215 forks source link

Unable to run AI2Thor in parallel on multiple GPUs on a node #470

Open parulaggarwal opened 3 years ago

parulaggarwal commented 3 years ago

I have some fixed trajectories. When I run ai2thor (version 2.1.0) inside a docker container and try to replicate some trajectories, it happens that sometimes the length of the metadata['colorBounds'] goes as huge as 20,000 and the container occupies huge amount of memory and the processing becomes extremely slow. The same image at other instances runs fine for some trajectories and then gets stuck. Yet at other instances the same image runs fine for all trajectories.

I have attached the logs (though I could not spot anything unusual in the logs). AI2-Thor.zip

The steps I follow are:

Create a docker image and run the container.
Inside the container, I start X server on a particular GPU.
Run the code on the same GPU.

(CUDA_VERSION=10.0). Here is the Dockerfile

parulaggarwal commented 3 years ago

I found that it happens when I start another X server (even if it is on another GPU and with another display number). As soon as I start the second X server, the first process running shows the above mentioned behavior. It looks like AI2Thor cannot be run on more than one GPUs if we want to run it parallely. Can someone please guide me here how I can achieve this.

roozbehm commented 3 years ago

We run processes on multiple GPUs for various projects. Let us reproduce the issue and we will get back to you.

ekolve commented 3 years ago

Just to clarify, to reproduce do you run your Docker container twice? Or does simply launching a second X server without running any ai2thor code while the Docker container is running reproduce your error?

parulaggarwal commented 3 years ago

Simply launching a second X server creates a problem. I launch X servers inside docker container.

Suppose there are 2 GPUs, these are the steps:

Start docker container 1, say DC1.
In DC1, start X server only on GPU 0.
In DC1, run AI2Thor code. It runs on GPU0.
Start docker container 2, say DC2.
In DC2, start X server only on GPU 1.
Note the effect on AI2Thor code that was running in step 3.

ekolve commented 3 years ago

Okay, I was able to reproduce your error in a few different ways. I believe we are running into a nvidia driver / opengl and Xorg issue. To address the first issue of running AI2Thor in parallel, this works fine as long as you start a Xorg server then launch 1 or more processes against that DISPLAY (the colorBounds count remains stable across all procs). If I started another Xorg server while a process was running my Unity window would disappear and an error would appear in the original Xorg.log. What this means is that if you want to use Docker and start Xorg within the container, you need to launch all of your ai2thor processes within the same container and avoid starting additional Xorg processes. Alternatively, if you are using ai2thor-docker and have a DISPLAY environment variable set, the container will receive the parameters to use that display instead of starting a new one.