Open dtch1997 opened 1 year ago
Hi @dtch1997,
Thanks for the bug report! Can you show me the output from nvidia-smi
? My guess this has something to do with ai2thor
requiring that an X-display has been started on the GPUs. If you don't have such X-displays started, you can start them (assuming you have sudo
on the machine) by running
sudo $(which ai2thor-xorg) start
There is a way to run ai2thor
without starting these displays but it would require a few more changes to a few files.
Hi @Lucaweihs ,
Sorry to meet you again. I also met the same problem in training models for objectnav. When I try to train the ResNet18-GRU model for ObjectNav in RoboTHOR with:
python main.py projects/objectnav_baselines/experiments/robothor/objectnav_robothor_rgb_resnet18gru_ddppo.py -o storage/objectnav-robothor-rgb-resnet18
The error message is:
[02/11 21:40:29 ERROR:] [train worker 0] Encountered TimeoutError, exiting. [engine.py: 1708]
[02/11 21:40:29 ERROR:] Traceback (most recent call last):
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 1700, in train
self.run_pipeline(valid_on_initial_weights=valid_on_initial_weights)
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 1495, in run_pipeline
num_paused = self.collect_step_across_all_task_samplers(
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 585, in collect_step_across_all_task_samplers
outputs: List[RLStepResult] = self.vector_tasks.step(
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 621, in step
return self.wait_step()
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 605, in wait_step
observations.extend(read_fn())
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 272, in read_with_timeout
raise TimeoutError(
TimeoutError: Did not receive output from `VectorSampledTask` worker for 60 seconds.
[engine.py: 1711]
[02/11 21:40:29 INFO:] [train worker 0] Closing OnPolicyRLEngine.vector_tasks. [engine.py: 684]
[02/11 21:40:29 ERROR:] Encountered Exception. Terminating runner. [runner.py: 1365]
[02/11 21:40:29 ERROR:] Traceback (most recent call last):
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/runner.py", line 1332, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[runner.py: 1366]
Traceback (most recent call last):
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/runner.py", line 1332, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
Meanwhile, here is the nvidia-smi
output:
Sat Feb 11 21:44:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 39C P8 35W / 350W | 10MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:41:00.0 Off | N/A |
| 30% 39C P8 28W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3147 G /usr/lib/Xorg 8MiB |
+-----------------------------------------------------------------------------+
When I tried to run sudo python scripts/startx.py
again, I got this error:
(EE)
Fatal server error:
(EE) Server is already active for display 0
If this server is no longer running, remove /tmp/.X0-lock
and start again.
(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE)
Furthermore, I found lots of processes in the top
like:
/home/*/.ai2thor/releases/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329a
I have to use
killall thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
to kill those processes.
Additionally, I also met another error when I run the command:
python main.py projects/objectnav_baselines/experiments/robothor/objectnav_robothor_rgbd_resnet18gru_ddppo.py -c pretrained_model_ckpts/robothor-objectnav-challenge-2021/Objectnav-RoboTHOR-RGBD-ResNetGRU-DDPPO/2021-02-09_22-35-15/exp_Objectnav-RoboTHOR-RGBD-ResNetGRU-DDPPO_0.2.0a_300M__stage_00__steps_000170207237.pt
Here is the error:
[02/11 21:35:24 ERROR:] Traceback (most recent call last):
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 979, in step
self.last_event = self.server.receive()
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 229, in receive
metadata, files = self._recv_message(
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 145, in _recv_message
header = self._read_with_timeout(
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 134, in _read_with_timeout
raise TimeoutError(f"Reading from AI2-THOR backend timed out (using {timeout}s) timeout.")
TimeoutError: Reading from AI2-THOR backend timed out (using 100.0s) timeout.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 977, in _task_sampling_loop_generator_fn
step_result: RLStepResult = current_task.step(data)
File "/home/*/Code/*/allenact/base_abstractions/task.py", line 124, in step
sr = self._step(action=action)
File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_tasks.py", line 284, in _step
self.env.step({"action": action_str})
File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_environment.py", line 475, in step
return self.controller.step(**action_dict)
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 997, in step
raise (TimeoutError if isinstance(e, TimeoutError) else RuntimeError)(
TimeoutError: Error encountered when running action {'action': 'RotateRight', 'sequenceId': 22} in scene FloorPlan_Train2_3.
[vector_sampled_tasks.py: 1083]
[02/11 21:35:24 INFO:] SingleProcessVectorSampledTask 0 closing. [vector_sampled_tasks.py: 1087]
[02/11 21:35:24 ERROR:] Worker 6 encountered an exception:
Traceback (most recent call last):
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 979, in step
self.last_event = self.server.receive()
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 229, in receive
metadata, files = self._recv_message(
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 145, in _recv_message
header = self._read_with_timeout(
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 134, in _read_with_timeout
raise TimeoutError(f"Reading from AI2-THOR backend timed out (using {timeout}s) timeout.")
TimeoutError: Reading from AI2-THOR backend timed out (using 100.0s) timeout.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 403, in _task_sampling_loop_worker
sp_vector_sampled_tasks.command(
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1271, in command
return [
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1272, in <listcomp>
g.send((command, data))
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1084, in _task_sampling_loop_generator_fn
raise e
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 977, in _task_sampling_loop_generator_fn
step_result: RLStepResult = current_task.step(data)
File "/home/*/Code/*/allenact/base_abstractions/task.py", line 124, in step
sr = self._step(action=action)
File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_tasks.py", line 284, in _step
self.env.step({"action": action_str})
File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_environment.py", line 475, in step
return self.controller.step(**action_dict)
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 997, in step
raise (TimeoutError if isinstance(e, TimeoutError) else RuntimeError)(
TimeoutError: Error encountered when running action {'action': 'RotateRight', 'sequenceId': 22} in scene FloorPlan_Train2_3.
[vector_sampled_tasks.py: 412]
[02/11 21:35:24 INFO:] Worker 6 closing. [vector_sampled_tasks.py: 425]
Process ForkServerProcess-1:7:
[02/11 21:35:24 ERROR:] [train worker 0] Encountered EOFError, exiting. [engine.py: 1708]
Traceback (most recent call last):
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 979, in step
self.last_event = self.server.receive()
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 229, in receive
metadata, files = self._recv_message(
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 145, in _recv_message
header = self._read_with_timeout(
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 134, in _read_with_timeout
raise TimeoutError(f"Reading from AI2-THOR backend timed out (using {timeout}s) timeout.")
TimeoutError: Reading from AI2-THOR backend timed out (using 100.0s) timeout.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 415, in _task_sampling_loop_worker
raise e
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 403, in _task_sampling_loop_worker
sp_vector_sampled_tasks.command(
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1271, in command
return [
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1272, in <listcomp>
g.send((command, data))
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1084, in _task_sampling_loop_generator_fn
raise e
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 977, in _task_sampling_loop_generator_fn
step_result: RLStepResult = current_task.step(data)
File "/home/*/Code/*/allenact/base_abstractions/task.py", line 124, in step
sr = self._step(action=action)
File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_tasks.py", line 284, in _step
self.env.step({"action": action_str})
File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_environment.py", line 475, in step
return self.controller.step(**action_dict)
File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 997, in step
raise (TimeoutError if isinstance(e, TimeoutError) else RuntimeError)(
TimeoutError: Error encountered when running action {'action': 'RotateRight', 'sequenceId': 22} in scene FloorPlan_Train2_3.
[02/11 21:35:24 ERROR:] Traceback (most recent call last):
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 1700, in train
self.run_pipeline(valid_on_initial_weights=valid_on_initial_weights)
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 1495, in run_pipeline
num_paused = self.collect_step_across_all_task_samplers(
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 585, in collect_step_across_all_task_samplers
outputs: List[RLStepResult] = self.vector_tasks.step(
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 621, in step
return self.wait_step()
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 605, in wait_step
observations.extend(read_fn())
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 276, in read_with_timeout
return read_fn()
File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
[engine.py: 1711]
[02/11 21:35:24 INFO:] [train worker 0] Closing OnPolicyRLEngine.vector_tasks. [engine.py: 684]
[02/11 21:35:24 ERROR:] Encountered Exception. Terminating runner. [runner.py: 1365]
[02/11 21:35:24 ERROR:] Traceback (most recent call last):
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/runner.py", line 1332, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[runner.py: 1366]
Traceback (most recent call last):
File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/runner.py", line 1332, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
Could you help me on these matters?
@Lucaweihs thanks for the prompt response!
I tried running sudo $(which ai2thor-xorg) start
as recommended. The effect was that my screen went blank and my computer seemed to go to sleep. After logging in again I couldn't detect any noticeable change. Attempting inference on RoboThor checkpoint ran into same error as before:
TimeoutError: Did not receive output from `VectorSampledTask` worker for 300 seconds.
Is there a way to check whether the necessary displays are on?
Also, here's nvidia-smi
:
$ nvidia-smi
Sat Feb 11 14:58:28 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 44C P8 16W / N/A | 605MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2433 G /usr/lib/xorg/Xorg 243MiB |O
| 0 N/A N/A 2784 G /usr/bin/gnome-shell 37MiB |
| 0 N/A N/A 3234 G ...AAAAAAAAA= --shared-files 45MiB |
| 0 N/A N/A 4305 G ...1/usr/lib/firefox/firefox 184MiB |
| 0 N/A N/A 5929 G ...veSuggestionsOnlyOnDemand 75MiB |
+-----------------------------------------------------------------------------+
Hi @dtch1997,
If I understood it correctly, you might be running from a terminal emulator on a computer with a screen attached (so xserver is probably already up for you, and therefore you shouldn't need to call ai2thor-xorg start
).
With your virtual environment active, if you start a python console and then run
from ai2thor.controller import Controller
c=Controller()
, does a window pop up?
Hi @jordis-ai2 , to test the inference, I am running Ubuntu 22.04 on my laptop with a monitor attached. I am using the built-in Terminal application.
When I ran:
from ai2thor.controller import Controller
c=Controller()
There was no window pop-up
I might be missing some detail, but can you try again after doing something like the first answer in this thread? I'm guessing you're using wayland instead of Xorg xserver.
@jordis-ai2 I made the change in /etc/gdm3/custom.conf
:
$ cat /etc/gdm3/custom.conf
...
[daemon]
# Uncomment the line below to force the login screen to use Xorg
WaylandEnable=false
...
And verified that I'm now using xorg
:
$ echo $XDG_SESSION_TYPE
x11
Didn't fix the previous issues
Hi @jordis-ai2 ,
I'm experiencing a TimeOut issue with the 'VectorSampledTask' that is similar to the issue you previously helped with. However, I'm using a headless workstation with Xorgs running. I have provided more details, including the error message and commands used, in my previous comments.
Would you be able to suggest a solution or provide guidance on how to address this issue?
Thank you.
Hi @dtch1997,
I've also encountered this issue while training with AllenAct, but I found a temporary solution that may help you as well. I was able to run the code using 'CloudRendering' by adding --config_kwargs "{'headless': True}"
to the command argparse.
I hope this workaround works for you too.
thanks @xiaobaishu0097 , I tried that but it didn't work due to the following error:
[02/21 09:27:24 ERROR:] Uncaught exception: [system.py: 209]
Traceback (most recent call last):
File "allenact/main.py", line 508, in <module>
main()
File "allenact/main.py", line 456, in main
cfg, srcs = load_config(args)
File "allenact/main.py", line 441, in load_config
config = experiments[0](**config_kwargs)
TypeError: PointNavRoboThorRGBPPOExperimentConfig() takes no arguments
Actually, as far as I understand, I am running the checkpoint without visualization, so I do not understand why rendering is needed or would cause issues. @Lucaweihs would you be able to comment on that?
Also, @jordis-ai2 I tried your previous suggestion, which didn't work. Is there possibly some simple diagnostic commands I can run which would help us recognize the problem?
Hi @dtch1997,
I assume there was no error message when you ran my last suggestion (despite no window showing up). I guess I'd try to look at ~/.ai2thor/log/unity.log
for some information, since it seems to me that the problem isn't directly related to AllenAct.
In any case, my general advice is to start running a standalone AI2THOR Controller
and attempt to step with some actions (like "RotateRight" or "MoveAhead") while observing the returned frame (controller.last_event.frame
). The first steps guide is a good starting point to check your setup is correct in this sense. I would perhaps also try the headless mode before moving to a full AllenAct experiment.
Let me know if that helps.
Problem
Unable to run pre-trained RoboThor model checkpoint
Steps to reproduce
Followed all instructions at https://allenact.org/tutorials/running-inference-on-a-pretrained-model/ Then ran:
Got the error:
Expected behavior
Able to run inference and save metrics to tensorboard.
Desktop
Please add the following information:
Additional context
Running on Python 3.8 in Anaconda