allenai / ai2thor-rearrangement

🔀 Visual Room Rearrangement
https://ai2thor.allenai.org/rearrangement
Apache License 2.0
104 stars 19 forks source link

xdpyinfo: unable to open display ":0.1". #7

Closed nnsriram97 closed 3 years ago

nnsriram97 commented 3 years ago

Hi,

I am facing an issue while trying to run the baseline models allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py.

xdpyinfo:  unable to open display ":0.1".
Process ForkServerProcess-2:1:
Traceback (most recent call last):
.
.
.
AssertionError: Invalid DISPLAY :0.1 - cannot find X server with xdpyinfo
04/01 17:30:22 ERROR: Encountered Exception. Terminating train worker 1 [engine.py: 1319]

Any suggestions to solve this? I can run python example.py successfully though.

Lucaweihs commented 3 years ago

Hi @nnsriram97,

Can you give some more information about your machine specs (e.g. # GPUs, operating system)?

The way the current code is set up to run an experiment is as follows:

  1. Count the number of GPUs on your machine by running torch.cuda.device_count()
  2. For each of the above GPUs, assume there is an x-display running on :0.0, :0.1, ..., :0.NUM_GPUS_ON_YOUR_MACHINE_MINUS_ONE (these x-displays are required for running AI2-THOR on Linux machines).
  3. Set up THOR processes on each of the above x-displays.
  4. Train the agent using the above GPUs for model inference/backprop and THOR simulation.

From your error message it looks like it can't find the x-display for your 1st GPU. We have a script in AllenAct that will automatically start x-displays on each of your GPUs, see our installation instructions and the script itself (you might have to close any display that's already open on :0.0 or edit the script to not start a new display there). Alternatively if you have a display already running on :0.0 and don't want to start new ones, you could simply have all the THOR processes run on a single GPU (you might run out of GPU memory in this case). To do this simply modify the lines here to simply be x_display = ":0.0".

If you want to temporarily use a smaller number of training processes (e.g. 1 for debugging and checking that things work) you can simply change the line here to be nprocesses = 1.

nnsriram97 commented 3 years ago

Hi @Lucaweihs,

I have 2 GTX 1080Ti's (Cuda 8.0, Nvidia-driver-460) on Ubuntu 16.04 with a display attached to it. I can successfully run the training code with x_display=0.0 and nprocesses = 1 but I cannot run it with default settings.

Using x_display=0.0 and setting nprocesses = cls.num_train_processes() if torch.cuda.is_available() else 1 runs but the program stops suddenly (as you mentioned it must be a memory issue), but setting nprocesses=15 or any number < 20, I can see the simulator output without hanging. Output of nvidia-smi when its running..

|    0   N/A  N/A      5120      C   Train-0                           889MiB |
|    0   N/A  N/A      5532      G   ...3c3596803c491c3da8d43eb2c       70MiB |
|    0   N/A  N/A      5533      G   ...3c3596803c491c3da8d43eb2c       72MiB |
.
.
.

|    0   N/A  N/A      6000      G   ...3c3596803c491c3da8d43eb2c       32MiB |
|    1   N/A  N/A      5121      C   Train-1                           889MiB |
|    1   N/A  N/A      5122      C   Valid-0                           889MiB |
+-----------------------------------------------------------------------------+

Using x_display = "0.{}".format(devices[process_ind % len(devices)]) does not work and rasises the issue as mentioned before. I also tried launching it on a headless server through slurm but got a similar issue xdpyinfo: unable to open display ":0.0".

Does not being able to launch x_display on 0.1 mean only one GPU is being used? Because I see Train-1 running on GPU 1. If that's the case can you suggest ways to run the code utilizing both GPUs with maximum compute? Also is it possible to run jobs on a headless server without sudo access?

Lucaweihs commented 3 years ago

It's strange that the xdpyinfo problem persists if an X-display is set up on :0.1 on GPU 1, just to double check, can you run DISPLAY=:0.1 glxgears? If everything is running appropriately, you should (after ~5 seconds) see something like this:

$ DISPLAY=:0.1 glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
111939 frames in 5.0 seconds = 22387.727 FPS
114639 frames in 5.0 seconds = 22927.768 FPS

and nvidia-smi should show some small memory usage on the GPU (~4mb).

Does not being able to launch x_display on 0.1 mean only one GPU is being used?

Thankfully no, you're still using both GPUs for inference/backprop but all of the THOR instances will be using a single GPU which can be a bit slower / use up valuable GPU memory.

Also is it possible to run jobs on a headless server without sudo access?

This has been a problem that the AI2-THOR team has been trying to resolve for a while, the short answer: not yet, you'll need your system administrator to set up the x-displays if you don't have sudo access yourself.

nnsriram97 commented 3 years ago

Running DISPLAY=:0.1 glxgears throws Error: couldn't open display :0.1, while DISPLAY=:0.0 glxgears runs without any issue

Lucaweihs commented 3 years ago

Given that glxgears doesn't run this suggests to me that you likely don't have an x-server running on :0.1. Do you have sudo access to start such a server? Recall that AllenAct has a instructions/script for starting x-displays on all GPUs:

We have a script in AllenAct that will automatically start x-displays on each of your GPUs, see our installation instructions and the script itself (you might have to close any display that's already open on :0.0 or edit the script to not start a new display there).

nnsriram97 commented 3 years ago

Thanks for pointing me to the script. I've sudo access but had a monitor attached to my pc and had Xorg running for display. Stopping the monitor service through sudo service lightdm stop and running startx.py worked. Here's a log of nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9917      G   /usr/lib/xorg/Xorg                 41MiB |
|    0   N/A  N/A     10005      C   Train-0                          1299MiB |
|    0   N/A  N/A     11210      G   ...3c3596803c491c3da8d43eb2c       67MiB |
.
.
.
|    0   N/A  N/A     13459      G   ...3c3596803c491c3da8d43eb2c       67MiB |
|    1   N/A  N/A      9917      G   /usr/lib/xorg/Xorg                  8MiB |
|    1   N/A  N/A     10006      C   Train-1                           889MiB |
|    1   N/A  N/A     10007      C   Valid-0                           889MiB |
+-----------------------------------------------------------------------------+

But I see all the thor instances running on GPU 0 while train-1 and valid-0 running on GPU 1. Is that normal?

Also, can you suggest how to debug/view the simulator output for some particular instance of training?

Lucaweihs commented 3 years ago

Interesting, after doing the above you should be see ai2thor processes on both gpus. Can you confirm that:

Also:

When running things on headless servers I often like to double check that the ai2thor processes are simulating by using a VNC. To do this yourself you can the following (you'll need to sudo apt install x11vnc wm2):

export DISPLAY=:0.0
nohup x11vnc -noxdamage -display :0.0 -nopw -once -xrandr -noxrecord -forever -grabalways --httpport 5900&
wm2&

which will start the vnc server on the server. You can then connect to this server locally by installing a VNC viewer (e.g. https://www.realvnc.com/en/connect/download/viewer/) and then setting up a new connection (cmd+N on Mac) with properties that look something like this (note that 5900 is the http port specified in the above block):

Screen Shot 2021-04-06 at 1 35 30 PM
nnsriram97 commented 3 years ago
  • DISPLAY=:0.0 glxgears and DISPLAY=:0.1 glxgears both work for you (you should also see the glxgears process using gpu memory when using nvidia-smi on the appropriate gpu)

I can successfully run the above commands and I see some GPU memory being used by glxgears

*you've changed x_display = "0.0" back to x_display = "0.{}".format(devices[process_ind % len(devices)]) within the configuration file?

Yes, I had changed it back to default in rearrange_base.py

*Can you tell me which version of ai2thor you have installed? I.e. the output of pip list | grep ai2thor. To be safe it might be good to update to the latest version: pip install --upgrade ai2thor.

I had ai2thor - 2.7.2 installed but then upgraded to 2.7.4. Thanks!

Please refer to issue #3 for status related to running the code.

Lucaweihs commented 3 years ago

Closing this as I believe things are training at reasonable FPS for you now, let me know if not!

nnsriram97 commented 3 years ago
[06/01 17:46:37 INFO:] Running with args Namespace(approx_ckpt_step_interval=None, approx_ckpt_steps_count=None, checkpoint=None, config_kwargs=None, deterministic_agents=False, deterministic_cudnn=False, disable_config_saving=False, disable_tensorboard=False, eval=False, experiment='baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py', experiment_base='.', extra_tag='', log_level='info', max_sampler_processes_per_worker=None, output_dir='rearrange_out', restart_pipeline=False, seed=None, skip_checkpoints=0, test_date=None) [main.py: 352]
[06/01 17:46:38 INFO:] Git diff saved to rearrange_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37[runner.py: 544]
[06/01 17:46:38 INFO:] Config files saved to rearrange_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37  [runner.py: 592]
[06/01 17:46:38 INFO:] Using 2 train workers on devices (device(type='cuda', index=0), device(type='cuda', index=1))    [runner.py: 205]
[06/01 17:46:38 INFO:] Started 2 train processes    [runner.py: 364]
[06/01 17:46:38 INFO:] Using 1 valid workers on devices (device(type='cuda', index=1),) [runner.py: 205]
[06/01 17:46:38 INFO:] Started 1 valid processes    [runner.py: 390]
[06/01 17:46:39 INFO:] train 1 args {'experiment_name': 'OnePhaseRGBResNetDagger_40proc', 'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7b88750>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7b88d90>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee118c7d0>, 'checkpoints_dir': 'rearrange_out/checkpoints/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37', 'seed': 1470811490, 'deterministic_cudnn': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee118cc50>, 'num_workers': 2, 'device': device(type='cuda', index=1), 'distributed_port': 51435, 'max_sampler_processes_per_worker': None, 'initial_model_state_dict': '[SUPRESSED]', 'mode': 'train', 'worker_id': 1}  [runner.py: 258]
[06/01 17:46:39 INFO:] valid 0 args {'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7b88690>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7b88cd0>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee118c750>, 'seed': 12345, 'deterministic_cudnn': False, 'deterministic_agents': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee118cb90>, 'device': device(type='cuda', index=1), 'max_sampler_processes_per_worker': None, 'mode': 'valid', 'worker_id': 0}    [runner.py: 273]
[06/01 17:46:39 INFO:] train 0 args {'experiment_name': 'OnePhaseRGBResNetDagger_40proc', 'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7ba7650>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7ba7c90>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee11a5d50>, 'checkpoints_dir': 'rearrange_out/checkpoints/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37', 'seed': 1470811490, 'deterministic_cudnn': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee11abad0>, 'num_workers': 2, 'device': device(type='cuda', index=0), 'distributed_port': 51435, 'max_sampler_processes_per_worker': None, 'initial_model_state_dict': '[SUPRESSED]', 'mode': 'train', 'worker_id': 0}  [runner.py: 258]
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating train worker 0   [engine.py: 1326]
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating train worker 1   [engine.py: 1326]
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
  File "/allenact/algorithms/onpolicy_sync/engine.py", line 1312, in train
    else cast(ActorCriticModel, self.actor_critic.module),
.
.
.
    raise error.DisplayConnectionError(self.display_name, r.reason)
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
    [engine.py: 1329]
.
.
.
    raise error.DisplayConnectionError(self.display_name, r.reason)
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
.
.
.
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating runner.  [runner.py: 936]
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
  File "...site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 899, in log
    package[1] - 1
Exception: Train worker 1 abnormally terminated
    [runner.py: 937]
Traceback (most recent call last):
  File "...site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 899, in log
    package[1] - 1
Exception: Train worker 1 abnormally terminated
[06/01 17:46:41 INFO:] Closing train 0  [runner.py: 1012]
[06/01 17:46:41 INFO:] Joining train 0  [runner.py: 1012]
[06/01 17:46:41 INFO:] Closed train 0   [runner.py: 1012]
[06/01 17:46:41 INFO:] Joining train 1  [runner.py: 1012]
[06/01 17:46:41 INFO:] Closed train 1   [runner.py: 1012]
[06/01 17:46:41 INFO:] Closing valid 0  [runner.py: 1012]
[06/01 17:46:41 INFO:] Joining valid 0  [runner.py: 1012]
[06/01 17:46:41 INFO:] KeyboardInterrupt. Terminating valid worker 0    [engine.py: 1596]
[06/01 17:46:41 INFO:] Closed valid 0   [runner.py: 1012]

After updating to the latest version of the rearrangement repo and allenact, running baseline models throws an error. My system details: Ubuntu 16.04, display attached, 2 GPUs and glxgears successfully runs on DISPLAY:=0.0

Update

Issue solved. Doing ls /tmp/.X11-unix/ gave X0 X1002 as outputs. Assigning open_display_strs = ['0'] in ithor_util.py solved the issue by trying to only use the attached display.

Lucaweihs commented 3 years ago

@nnsriram97 glad to hear you found a solution! We tried to make this "easier" for people by automatically discovering the x-display but it looks like this didn't like the X1002 display. Any idea what X1002 might be from? This would help us avoid this in the future.