TimeOut error when attempting to run pre-trained RoboThor model checkpoint

dtch1997 commented 1 year ago

Problem

Unable to run pre-trained RoboThor model checkpoint

Steps to reproduce

Followed all instructions at https://allenact.org/tutorials/running-inference-on-a-pretrained-model/ Then ran:

PYTHONPATH=. python allenact/main.py \
training_a_pointnav_model \
-o pretrained_model_ckpts/robothor-pointnav-rgb-resnet/ \
-b projects/tutorials \
-c pretrained_model_ckpts/robothor-pointnav-rgb-resnet/checkpoints/PointNavRobothorRGBPPO/2020-08-31_12-13-30/exp_PointNavRobothorRGBPPO__stage_00__steps_000039031200.pt \
--eval

Got the error:

[02/10 14:58:23 ERROR:] Traceback (most recent call last):
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 1992, in process_checkpoints
    eval_package = self.run_eval(
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 1782, in run_eval
    num_paused = self.initialize_storage_and_viz(
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 455, in initialize_storage_and_viz
    observations = self.vector_tasks.get_observations()
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 309, in vector_tasks
    self._vector_tasks = VectorSampledTasks(
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 234, in __init__
    observation_spaces = [
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 237, in <listcomp>
    for space in read_fn(timeout_to_use=5 * self.read_timeout if self.read_timeout is not None else None)  # type: ignore
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 272, in read_with_timeout
    raise TimeoutError(
TimeoutError: Did not receive output from `VectorSampledTask` worker for 300 seconds.

Expected behavior

Able to run inference and save metrics to tensorboard.

Desktop

Please add the following information:

OS: Ubuntu 22.04
AllenAct Version: commit 24907f16cd6aace1abb2fef90c8e8667859c38b8

Additional context

Running on Python 3.8 in Anaconda

Lucaweihs commented 1 year ago

Hi @dtch1997,

Thanks for the bug report! Can you show me the output from nvidia-smi? My guess this has something to do with ai2thor requiring that an X-display has been started on the GPUs. If you don't have such X-displays started, you can start them (assuming you have sudo on the machine) by running

sudo $(which ai2thor-xorg) start

There is a way to run ai2thor without starting these displays but it would require a few more changes to a few files.

xiaobaishu0097 commented 1 year ago

Hi @Lucaweihs ,

Sorry to meet you again. I also met the same problem in training models for objectnav. When I try to train the ResNet18-GRU model for ObjectNav in RoboTHOR with:

python main.py projects/objectnav_baselines/experiments/robothor/objectnav_robothor_rgb_resnet18gru_ddppo.py -o storage/objectnav-robothor-rgb-resnet18

The error message is:

[02/11 21:40:29 ERROR:] [train worker 0] Encountered TimeoutError, exiting.     [engine.py: 1708]
[02/11 21:40:29 ERROR:] Traceback (most recent call last):
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 1700, in train
    self.run_pipeline(valid_on_initial_weights=valid_on_initial_weights)
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 1495, in run_pipeline
    num_paused = self.collect_step_across_all_task_samplers(
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 585, in collect_step_across_all_task_samplers
    outputs: List[RLStepResult] = self.vector_tasks.step(
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 621, in step
    return self.wait_step()
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 605, in wait_step
    observations.extend(read_fn())
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 272, in read_with_timeout
    raise TimeoutError(
TimeoutError: Did not receive output from `VectorSampledTask` worker for 60 seconds.
        [engine.py: 1711]
[02/11 21:40:29 INFO:] [train worker 0] Closing OnPolicyRLEngine.vector_tasks.  [engine.py: 684]
[02/11 21:40:29 ERROR:] Encountered Exception. Terminating runner.      [runner.py: 1365]
[02/11 21:40:29 ERROR:] Traceback (most recent call last):
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/runner.py", line 1332, in log_and_close
    raise Exception(
Exception: Train worker 0 abnormally terminated
        [runner.py: 1366]
Traceback (most recent call last):
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/runner.py", line 1332, in log_and_close
    raise Exception(
Exception: Train worker 0 abnormally terminated

Meanwhile, here is the nvidia-smi output:

Sat Feb 11 21:44:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   39C    P8    35W / 350W |     10MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:41:00.0 Off |                  N/A |
| 30%   39C    P8    28W / 350W |      1MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3147      G   /usr/lib/Xorg                       8MiB |
+-----------------------------------------------------------------------------+

When I tried to run sudo python scripts/startx.py again, I got this error:

(EE)
Fatal server error:
(EE) Server is already active for display 0
        If this server is no longer running, remove /tmp/.X0-lock
        and start again.
(EE)
(EE)
Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
(EE)

Furthermore, I found lots of processes in the top like:

/home/*/.ai2thor/releases/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329a

I have to use

killall thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e

to kill those processes.

Additionally, I also met another error when I run the command:

python main.py projects/objectnav_baselines/experiments/robothor/objectnav_robothor_rgbd_resnet18gru_ddppo.py -c pretrained_model_ckpts/robothor-objectnav-challenge-2021/Objectnav-RoboTHOR-RGBD-ResNetGRU-DDPPO/2021-02-09_22-35-15/exp_Objectnav-RoboTHOR-RGBD-ResNetGRU-DDPPO_0.2.0a_300M__stage_00__steps_000170207237.pt

Here is the error:


[02/11 21:35:24 ERROR:] Traceback (most recent call last):
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 979, in step
    self.last_event = self.server.receive()
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 229, in receive
    metadata, files = self._recv_message(
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 145, in _recv_message
    header = self._read_with_timeout(
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 134, in _read_with_timeout
    raise TimeoutError(f"Reading from AI2-THOR backend timed out (using {timeout}s) timeout.")
TimeoutError: Reading from AI2-THOR backend timed out (using 100.0s) timeout.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 977, in _task_sampling_loop_generator_fn
    step_result: RLStepResult = current_task.step(data)
  File "/home/*/Code/*/allenact/base_abstractions/task.py", line 124, in step
    sr = self._step(action=action)
  File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_tasks.py", line 284, in _step
    self.env.step({"action": action_str})
  File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_environment.py", line 475, in step
    return self.controller.step(**action_dict)
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 997, in step
    raise (TimeoutError if isinstance(e, TimeoutError) else RuntimeError)(
TimeoutError: Error encountered when running action {'action': 'RotateRight', 'sequenceId': 22} in scene FloorPlan_Train2_3.
        [vector_sampled_tasks.py: 1083]
[02/11 21:35:24 INFO:] SingleProcessVectorSampledTask 0 closing.        [vector_sampled_tasks.py: 1087]
[02/11 21:35:24 ERROR:] Worker 6 encountered an exception:
Traceback (most recent call last):
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 979, in step
    self.last_event = self.server.receive()
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 229, in receive
    metadata, files = self._recv_message(
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 145, in _recv_message
    header = self._read_with_timeout(
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 134, in _read_with_timeout
    raise TimeoutError(f"Reading from AI2-THOR backend timed out (using {timeout}s) timeout.")
TimeoutError: Reading from AI2-THOR backend timed out (using 100.0s) timeout.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 403, in _task_sampling_loop_worker
    sp_vector_sampled_tasks.command(
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1271, in command
    return [
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1272, in <listcomp>
    g.send((command, data))
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1084, in _task_sampling_loop_generator_fn
    raise e
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 977, in _task_sampling_loop_generator_fn
    step_result: RLStepResult = current_task.step(data)
  File "/home/*/Code/*/allenact/base_abstractions/task.py", line 124, in step
    sr = self._step(action=action)
  File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_tasks.py", line 284, in _step
    self.env.step({"action": action_str})
  File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_environment.py", line 475, in step
    return self.controller.step(**action_dict)
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 997, in step
    raise (TimeoutError if isinstance(e, TimeoutError) else RuntimeError)(
TimeoutError: Error encountered when running action {'action': 'RotateRight', 'sequenceId': 22} in scene FloorPlan_Train2_3.
        [vector_sampled_tasks.py: 412]
[02/11 21:35:24 INFO:] Worker 6 closing.        [vector_sampled_tasks.py: 425]
Process ForkServerProcess-1:7:
[02/11 21:35:24 ERROR:] [train worker 0] Encountered EOFError, exiting. [engine.py: 1708]
Traceback (most recent call last):
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 979, in step
    self.last_event = self.server.receive()
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 229, in receive
    metadata, files = self._recv_message(
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 145, in _recv_message
    header = self._read_with_timeout(
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/fifo_server.py", line 134, in _read_with_timeout
    raise TimeoutError(f"Reading from AI2-THOR backend timed out (using {timeout}s) timeout.")
TimeoutError: Reading from AI2-THOR backend timed out (using 100.0s) timeout.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 415, in _task_sampling_loop_worker
    raise e
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 403, in _task_sampling_loop_worker
    sp_vector_sampled_tasks.command(
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1271, in command
    return [
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1272, in <listcomp>
    g.send((command, data))
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1084, in _task_sampling_loop_generator_fn
    raise e
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 977, in _task_sampling_loop_generator_fn
    step_result: RLStepResult = current_task.step(data)
  File "/home/*/Code/*/allenact/base_abstractions/task.py", line 124, in step
    sr = self._step(action=action)
  File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_tasks.py", line 284, in _step
    self.env.step({"action": action_str})
  File "/home/*/Code/*/allenact_plugins/robothor_plugin/robothor_environment.py", line 475, in step
    return self.controller.step(**action_dict)
  File "/home/*/anaconda3/envs/*/lib/python3.9/site-packages/ai2thor/controller.py", line 997, in step
    raise (TimeoutError if isinstance(e, TimeoutError) else RuntimeError)(
TimeoutError: Error encountered when running action {'action': 'RotateRight', 'sequenceId': 22} in scene FloorPlan_Train2_3.
[02/11 21:35:24 ERROR:] Traceback (most recent call last):
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 1700, in train
    self.run_pipeline(valid_on_initial_weights=valid_on_initial_weights)
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 1495, in run_pipeline
    num_paused = self.collect_step_across_all_task_samplers(
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/engine.py", line 585, in collect_step_across_all_task_samplers
    outputs: List[RLStepResult] = self.vector_tasks.step(
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 621, in step
    return self.wait_step()
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 605, in wait_step
    observations.extend(read_fn())
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 276, in read_with_timeout
    return read_fn()
  File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/*/anaconda3/envs/*/lib/python3.9/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
        [engine.py: 1711]
[02/11 21:35:24 INFO:] [train worker 0] Closing OnPolicyRLEngine.vector_tasks.  [engine.py: 684]
[02/11 21:35:24 ERROR:] Encountered Exception. Terminating runner.      [runner.py: 1365]
[02/11 21:35:24 ERROR:] Traceback (most recent call last):
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/runner.py", line 1332, in log_and_close
    raise Exception(
Exception: Train worker 0 abnormally terminated
        [runner.py: 1366]
Traceback (most recent call last):
  File "/home/*/Code/*/allenact/algorithms/onpolicy_sync/runner.py", line 1332, in log_and_close
    raise Exception(
Exception: Train worker 0 abnormally terminated

Could you help me on these matters?

dtch1997 commented 1 year ago

@Lucaweihs thanks for the prompt response!

I tried running sudo $(which ai2thor-xorg) start as recommended. The effect was that my screen went blank and my computer seemed to go to sleep. After logging in again I couldn't detect any noticeable change. Attempting inference on RoboThor checkpoint ran into same error as before:

TimeoutError: Did not receive output from `VectorSampledTask` worker for 300 seconds.

Is there a way to check whether the necessary displays are on?

Also, here's nvidia-smi:

$ nvidia-smi
Sat Feb 11 14:58:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   44C    P8    16W /  N/A |    605MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2433      G   /usr/lib/xorg/Xorg                243MiB |O 
|    0   N/A  N/A      2784      G   /usr/bin/gnome-shell               37MiB |
|    0   N/A  N/A      3234      G   ...AAAAAAAAA= --shared-files       45MiB |
|    0   N/A  N/A      4305      G   ...1/usr/lib/firefox/firefox      184MiB |
|    0   N/A  N/A      5929      G   ...veSuggestionsOnlyOnDemand       75MiB |
+-----------------------------------------------------------------------------+

jordis-ai2 commented 1 year ago

Hi @dtch1997,

If I understood it correctly, you might be running from a terminal emulator on a computer with a screen attached (so xserver is probably already up for you, and therefore you shouldn't need to call ai2thor-xorg start).

With your virtual environment active, if you start a python console and then run

from ai2thor.controller import Controller
c=Controller()

, does a window pop up?

dtch1997 commented 1 year ago

Hi @jordis-ai2 , to test the inference, I am running Ubuntu 22.04 on my laptop with a monitor attached. I am using the built-in Terminal application.

When I ran:

from ai2thor.controller import Controller
c=Controller()

There was no window pop-up

jordis-ai2 commented 1 year ago

I might be missing some detail, but can you try again after doing something like the first answer in this thread? I'm guessing you're using wayland instead of Xorg xserver.

dtch1997 commented 1 year ago

@jordis-ai2 I made the change in /etc/gdm3/custom.conf:

$ cat /etc/gdm3/custom.conf
...
[daemon]
# Uncomment the line below to force the login screen to use Xorg
WaylandEnable=false
...

And verified that I'm now using xorg:

$ echo $XDG_SESSION_TYPE
x11

Didn't fix the previous issues

xiaobaishu0097 commented 1 year ago

Hi @jordis-ai2 ,

I'm experiencing a TimeOut issue with the 'VectorSampledTask' that is similar to the issue you previously helped with. However, I'm using a headless workstation with Xorgs running. I have provided more details, including the error message and commands used, in my previous comments.

Would you be able to suggest a solution or provide guidance on how to address this issue?

Thank you.

xiaobaishu0097 commented 1 year ago

Hi @dtch1997,

I've also encountered this issue while training with AllenAct, but I found a temporary solution that may help you as well. I was able to run the code using 'CloudRendering' by adding --config_kwargs "{'headless': True}" to the command argparse.

I hope this workaround works for you too.

dtch1997 commented 1 year ago

thanks @xiaobaishu0097 , I tried that but it didn't work due to the following error:

[02/21 09:27:24 ERROR:] Uncaught exception: [system.py: 209]
Traceback (most recent call last):
  File "allenact/main.py", line 508, in <module>
    main()
  File "allenact/main.py", line 456, in main
    cfg, srcs = load_config(args)
  File "allenact/main.py", line 441, in load_config
    config = experiments[0](**config_kwargs)
TypeError: PointNavRoboThorRGBPPOExperimentConfig() takes no arguments

Actually, as far as I understand, I am running the checkpoint without visualization, so I do not understand why rendering is needed or would cause issues. @Lucaweihs would you be able to comment on that?

Also, @jordis-ai2 I tried your previous suggestion, which didn't work. Is there possibly some simple diagnostic commands I can run which would help us recognize the problem?

jordis-ai2 commented 1 year ago

Hi @dtch1997,

I assume there was no error message when you ran my last suggestion (despite no window showing up). I guess I'd try to look at ~/.ai2thor/log/unity.log for some information, since it seems to me that the problem isn't directly related to AllenAct.

In any case, my general advice is to start running a standalone AI2THOR Controller and attempt to step with some actions (like "RotateRight" or "MoveAhead") while observing the returned frame (controller.last_event.frame). The first steps guide is a good starting point to check your setup is correct in this sense. I would perhaps also try the headless mode before moving to a full AllenAct experiment.

Let me know if that helps.

allenai / allenact