Closed nnsriram97 closed 3 years ago
Hi @nnsriram97,
Can you give some more information about your machine specs (e.g. # GPUs, operating system)?
The way the current code is set up to run an experiment is as follows:
torch.cuda.device_count()
:0.0
, :0.1
, ..., :0.NUM_GPUS_ON_YOUR_MACHINE_MINUS_ONE
(these x-displays are required for running AI2-THOR on Linux machines).From your error message it looks like it can't find the x-display for your 1st GPU. We have a script in AllenAct that will automatically start x-displays on each of your GPUs, see our installation instructions and the script itself (you might have to close any display that's already open on :0.0
or edit the script to not start a new display there). Alternatively if you have a display already running on :0.0
and don't want to start new ones, you could simply have all the THOR processes run on a single GPU (you might run out of GPU memory in this case). To do this simply modify the lines here to simply be x_display = ":0.0"
.
If you want to temporarily use a smaller number of training processes (e.g. 1 for debugging and checking that things work) you can simply change the line here to be nprocesses = 1
.
Hi @Lucaweihs,
I have 2 GTX 1080Ti's (Cuda 8.0, Nvidia-driver-460) on Ubuntu 16.04 with a display attached to it. I can successfully run the training code with x_display=0.0
and nprocesses = 1
but I cannot run it with default settings.
Using x_display=0.0
and setting nprocesses = cls.num_train_processes() if torch.cuda.is_available() else 1
runs but the program stops suddenly (as you mentioned it must be a memory issue), but setting nprocesses=15
or any number < 20, I can see the simulator output without hanging. Output of nvidia-smi
when its running..
| 0 N/A N/A 5120 C Train-0 889MiB |
| 0 N/A N/A 5532 G ...3c3596803c491c3da8d43eb2c 70MiB |
| 0 N/A N/A 5533 G ...3c3596803c491c3da8d43eb2c 72MiB |
.
.
.
| 0 N/A N/A 6000 G ...3c3596803c491c3da8d43eb2c 32MiB |
| 1 N/A N/A 5121 C Train-1 889MiB |
| 1 N/A N/A 5122 C Valid-0 889MiB |
+-----------------------------------------------------------------------------+
Using x_display = "0.{}".format(devices[process_ind % len(devices)])
does not work and rasises the issue as mentioned before. I also tried launching it on a headless server through slurm but got a similar issue xdpyinfo: unable to open display ":0.0".
Does not being able to launch x_display on 0.1 mean only one GPU is being used? Because I see Train-1 running on GPU 1. If that's the case can you suggest ways to run the code utilizing both GPUs with maximum compute? Also is it possible to run jobs on a headless server without sudo access?
It's strange that the xdpyinfo problem persists if an X-display is set up on :0.1
on GPU 1, just to double check, can you run DISPLAY=:0.1 glxgears
? If everything is running appropriately, you should (after ~5 seconds) see something like this:
$ DISPLAY=:0.1 glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
111939 frames in 5.0 seconds = 22387.727 FPS
114639 frames in 5.0 seconds = 22927.768 FPS
and nvidia-smi
should show some small memory usage on the GPU (~4mb).
Does not being able to launch x_display on 0.1 mean only one GPU is being used?
Thankfully no, you're still using both GPUs for inference/backprop but all of the THOR instances will be using a single GPU which can be a bit slower / use up valuable GPU memory.
Also is it possible to run jobs on a headless server without sudo access?
This has been a problem that the AI2-THOR team has been trying to resolve for a while, the short answer: not yet, you'll need your system administrator to set up the x-displays if you don't have sudo access yourself.
Running DISPLAY=:0.1 glxgears
throws Error: couldn't open display :0.1
, while DISPLAY=:0.0 glxgears
runs without any issue
Given that glxgears doesn't run this suggests to me that you likely don't have an x-server running on :0.1
. Do you have sudo access to start such a server? Recall that AllenAct has a instructions/script for starting x-displays on all GPUs:
We have a script in AllenAct that will automatically start x-displays on each of your GPUs, see our installation instructions and the script itself (you might have to close any display that's already open on
:0.0
or edit the script to not start a new display there).
Thanks for pointing me to the script. I've sudo access but had a monitor attached to my pc and had Xorg running for display. Stopping the monitor service through sudo service lightdm stop
and running startx.py worked. Here's a log of nvidia-smi:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 9917 G /usr/lib/xorg/Xorg 41MiB |
| 0 N/A N/A 10005 C Train-0 1299MiB |
| 0 N/A N/A 11210 G ...3c3596803c491c3da8d43eb2c 67MiB |
.
.
.
| 0 N/A N/A 13459 G ...3c3596803c491c3da8d43eb2c 67MiB |
| 1 N/A N/A 9917 G /usr/lib/xorg/Xorg 8MiB |
| 1 N/A N/A 10006 C Train-1 889MiB |
| 1 N/A N/A 10007 C Valid-0 889MiB |
+-----------------------------------------------------------------------------+
But I see all the thor instances running on GPU 0 while train-1 and valid-0 running on GPU 1. Is that normal?
Also, can you suggest how to debug/view the simulator output for some particular instance of training?
Interesting, after doing the above you should be see ai2thor processes on both gpus. Can you confirm that:
DISPLAY=:0.0 glxgears
and DISPLAY=:0.1 glxgears
both work for you (you should also see the glxgears process using gpu memory when using nvidia-smi on the appropriate gpu)x_display = "0.0"
back to x_display = "0.{}".format(devices[process_ind % len(devices)])
within the configuration file
?Also:
pip list | grep ai2thor
. To be safe it might be good to update to the latest version: pip install --upgrade ai2thor
.When running things on headless servers I often like to double check that the ai2thor processes are simulating by using a VNC. To do this yourself you can the following (you'll need to sudo apt install x11vnc wm2
):
export DISPLAY=:0.0
nohup x11vnc -noxdamage -display :0.0 -nopw -once -xrandr -noxrecord -forever -grabalways --httpport 5900&
wm2&
which will start the vnc server on the server. You can then connect to this server locally by installing a VNC viewer (e.g. https://www.realvnc.com/en/connect/download/viewer/) and then setting up a new connection (cmd+N on Mac) with properties that look something like this (note that 5900 is the http port specified in the above block):
DISPLAY=:0.0 glxgears
andDISPLAY=:0.1 glxgears
both work for you (you should also see the glxgears process using gpu memory when using nvidia-smi on the appropriate gpu)
I can successfully run the above commands and I see some GPU memory being used by glxgears
*you've changed x_display = "0.0" back to x_display = "0.{}".format(devices[process_ind % len(devices)]) within the configuration file?
Yes, I had changed it back to default in rearrange_base.py
*Can you tell me which version of ai2thor you have installed? I.e. the output of pip list | grep ai2thor. To be safe it might be good to update to the latest version: pip install --upgrade ai2thor.
I had ai2thor - 2.7.2 installed but then upgraded to 2.7.4. Thanks!
Please refer to issue #3 for status related to running the code.
Closing this as I believe things are training at reasonable FPS for you now, let me know if not!
[06/01 17:46:37 INFO:] Running with args Namespace(approx_ckpt_step_interval=None, approx_ckpt_steps_count=None, checkpoint=None, config_kwargs=None, deterministic_agents=False, deterministic_cudnn=False, disable_config_saving=False, disable_tensorboard=False, eval=False, experiment='baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py', experiment_base='.', extra_tag='', log_level='info', max_sampler_processes_per_worker=None, output_dir='rearrange_out', restart_pipeline=False, seed=None, skip_checkpoints=0, test_date=None) [main.py: 352]
[06/01 17:46:38 INFO:] Git diff saved to rearrange_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37[runner.py: 544]
[06/01 17:46:38 INFO:] Config files saved to rearrange_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37 [runner.py: 592]
[06/01 17:46:38 INFO:] Using 2 train workers on devices (device(type='cuda', index=0), device(type='cuda', index=1)) [runner.py: 205]
[06/01 17:46:38 INFO:] Started 2 train processes [runner.py: 364]
[06/01 17:46:38 INFO:] Using 1 valid workers on devices (device(type='cuda', index=1),) [runner.py: 205]
[06/01 17:46:38 INFO:] Started 1 valid processes [runner.py: 390]
[06/01 17:46:39 INFO:] train 1 args {'experiment_name': 'OnePhaseRGBResNetDagger_40proc', 'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7b88750>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7b88d90>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee118c7d0>, 'checkpoints_dir': 'rearrange_out/checkpoints/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37', 'seed': 1470811490, 'deterministic_cudnn': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee118cc50>, 'num_workers': 2, 'device': device(type='cuda', index=1), 'distributed_port': 51435, 'max_sampler_processes_per_worker': None, 'initial_model_state_dict': '[SUPRESSED]', 'mode': 'train', 'worker_id': 1} [runner.py: 258]
[06/01 17:46:39 INFO:] valid 0 args {'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7b88690>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7b88cd0>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee118c750>, 'seed': 12345, 'deterministic_cudnn': False, 'deterministic_agents': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee118cb90>, 'device': device(type='cuda', index=1), 'max_sampler_processes_per_worker': None, 'mode': 'valid', 'worker_id': 0} [runner.py: 273]
[06/01 17:46:39 INFO:] train 0 args {'experiment_name': 'OnePhaseRGBResNetDagger_40proc', 'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7ba7650>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7ba7c90>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee11a5d50>, 'checkpoints_dir': 'rearrange_out/checkpoints/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37', 'seed': 1470811490, 'deterministic_cudnn': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee11abad0>, 'num_workers': 2, 'device': device(type='cuda', index=0), 'distributed_port': 51435, 'max_sampler_processes_per_worker': None, 'initial_model_state_dict': '[SUPRESSED]', 'mode': 'train', 'worker_id': 0} [runner.py: 258]
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating train worker 0 [engine.py: 1326]
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating train worker 1 [engine.py: 1326]
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
File "/allenact/algorithms/onpolicy_sync/engine.py", line 1312, in train
else cast(ActorCriticModel, self.actor_critic.module),
.
.
.
raise error.DisplayConnectionError(self.display_name, r.reason)
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
[engine.py: 1329]
.
.
.
raise error.DisplayConnectionError(self.display_name, r.reason)
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
.
.
.
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating runner. [runner.py: 936]
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
File "...site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 899, in log
package[1] - 1
Exception: Train worker 1 abnormally terminated
[runner.py: 937]
Traceback (most recent call last):
File "...site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 899, in log
package[1] - 1
Exception: Train worker 1 abnormally terminated
[06/01 17:46:41 INFO:] Closing train 0 [runner.py: 1012]
[06/01 17:46:41 INFO:] Joining train 0 [runner.py: 1012]
[06/01 17:46:41 INFO:] Closed train 0 [runner.py: 1012]
[06/01 17:46:41 INFO:] Joining train 1 [runner.py: 1012]
[06/01 17:46:41 INFO:] Closed train 1 [runner.py: 1012]
[06/01 17:46:41 INFO:] Closing valid 0 [runner.py: 1012]
[06/01 17:46:41 INFO:] Joining valid 0 [runner.py: 1012]
[06/01 17:46:41 INFO:] KeyboardInterrupt. Terminating valid worker 0 [engine.py: 1596]
[06/01 17:46:41 INFO:] Closed valid 0 [runner.py: 1012]
After updating to the latest version of the rearrangement repo and allenact, running baseline models throws an error. My system details: Ubuntu 16.04, display attached, 2 GPUs and glxgears successfully runs on DISPLAY:=0.0
Issue solved. Doing ls /tmp/.X11-unix/
gave X0 X1002
as outputs. Assigning open_display_strs = ['0']
in ithor_util.py solved the issue by trying to only use the attached display.
@nnsriram97 glad to hear you found a solution! We tried to make this "easier" for people by automatically discovering the x-display but it looks like this didn't like the X1002
display. Any idea what X1002
might be from? This would help us avoid this in the future.
Hi,
I am facing an issue while trying to run the baseline models
allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py
.Any suggestions to solve this? I can run
python example.py
successfully though.