Code is getting stuck after few iterations

suvaansh commented 2 years ago

I am facing the following issues:

The code seems to get stuck after running a few iterations
It's printing a lot of things which I am not able to understand properly.

I have attached a screenshot of the terminal output for reference. It's not writing anything in the tensorboard files being created, probably because the code is getting stuck.

P.S. I am running on a headless server. Please let me know if I need to make some changes in the configs for that.

System information:

command: CUDA_VISIBLE_DEVICES=4,5,6,7,8,9,10,11 PYTHONPATH=. python allenact/main.py -o storage/objectnav-robothor-rgb-clip-rn50 -b projects/objectnav_baselines/experiments/robothor/clip objectnav_robothor_rgb_clipresnet50gru_ddppo

gpus/cpus: machine has 88 cpus and 12 gpus. I am using 8 gpus.

linux: Linux version 5.4.0-99-generic (buildd@lcy02-amd64-045) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #112~18.04.1-Ubuntu SMP Thu Feb 3 14:09:57 UTC 2022

pytorch: '1.8.1+cu111'

Lucaweihs commented 2 years ago

Hi @suvaansh,

It looks to me like things are freezing during the initialization of the AI2-THOR environment. Can you confirm that the following code runs for you? Note that you'll need to specify YOUR_X_DISPLAY_STRING in the below. Given your list of visible cuda devices, and assuming you used the sudo ai2thor-xorg start command, then I suspect that "0.4" would work for this.

from ai2thor.controller import Controller

c = Controller(
  commit_id="bad5bc2b250615cb766ffb45d455c211329af17e",
  x_display=YOUR_X_DISPLAY_STRING
)
c.step("MoveAhead")

print(c.last_event.metadata["agent"])

# Prints
# {'name': 'agent', 'position': {'x': 0.25, 'y': 0.9009992480278015, 'z': -1.25}, 'rotation': {'x': -0.0, 'y': 90.0, 'z': 0.0}, 'cameraHorizon': -0.0, 'isStanding': True, 'inHighFrictionArea': False}

suvaansh commented 2 years ago

Hi @Lucaweihs Sorry for the late response. The given script is working in the machine but still the original bug persists. I have also attached a screenshot of the output.

apoorvkh commented 2 years ago

I have transferred this issue to allenai/allenact, because the robothor-objectnav branch in allenai/embodied-clip (that you are training from) is exactly equivalent to allenact v0.5.0.

jordis-ai2 commented 2 years ago

Hi @suvaansh,

Just for a sanity check, can you try running with CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7?

suvaansh commented 2 years ago

Hi @jordis-ai2 , I tried running with CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7. The code is still freezing like before. I also tried using a single gpu, that also doesn't work.

I forgot to notify one thing. I changed the default value of headless to True in line. Otherwise it was giving error because the machine is headless. Please let me know if this could be the issue.

jordis-ai2 commented 2 years ago

May I also ask you to share the output of nvidia-smi and ps aux | grep Xorg after sudo ai2thor-xorg start?

suvaansh commented 2 years ago

This command gives the following error sudo: ai2thor-xorg: command not found So I used the startx.py script for starting the Xorg.

The asked outputs are attached:

jordis-ai2 commented 2 years ago

That looks good 👍 I would try to start training without headless mode, if that's an option. I would use a subset of the available x_displays in

https://github.com/allenai/allenact/blob/474fb84789fcdd3917fbc006365653971994e93f/projects/objectnav_baselines/experiments/objectnav_thor_base.py#L237

In my experience, the display and cuda device orders are usually aligned, so my first try would be to select x_displays = x_displays[4:].

Let me know if that helps.

brandontrabucco commented 2 years ago

Hi allenact team,

I'm having a similar problem when attempting to train an embodied-clip model using allenact.

I'm running the following command:

allenact -o rearrange_out -m 1 -b . baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py

The code seems to get stuck when initializing the AI2-THOR controller. Below you can find the stack trace after I use CTRL+C to interrupt the script after it has gotten stuck, which shows the FIFO server seems to be getting stuck.

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 302, in _task_sampling_loop_worker
    sp_vector_sampled_tasks = SingleProcessVectorSampledTasks(
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 843, in __init__
    self._vector_task_generators: List[Generator] = self._create_generators(
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1050, in _create_generators
    if next(generators[-1]) != "started":
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 905, in _task_sampling_loop_generator_fn
    task_sampler = make_sampler_fn(**sampler_fn_args)
  File "/home/ubuntu/embclip-rearrangement/baseline_configs/one_phase/one_phase_rgb_base.py", line 76, in make_sampler_fn
    return RearrangeTaskSampler.from_fixed_dataset(
  File "/home/ubuntu/embclip-rearrangement/rearrange/tasks.py", line 877, in from_fixed_dataset
    return cls(
  File "/home/ubuntu/embclip-rearrangement/rearrange/tasks.py", line 837, in __init__
    self.unshuffle_env = RearrangeTHOREnvironment(**rearrange_env_kwargs)
  File "/home/ubuntu/embclip-rearrangement/rearrange/environment.py", line 246, in __init__
    self.controller = self.create_controller()
  File "/home/ubuntu/embclip-rearrangement/rearrange/environment.py", line 260, in create_controller
    controller = ai2thor.controller.Controller(
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/site-packages/ai2thor/controller.py", line 492, in __init__
    self.start(
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/site-packages/ai2thor/controller.py", line 1299, in start
    self.last_event = self.server.receive()
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/site-packages/ai2thor/fifo_server.py", line 182, in receive
    metadata, files = self._recv_message()
  File "/home/ubuntu/anaconda3/envs/embclip-rearrange/lib/python3.8/site-packages/ai2thor/fifo_server.py", line 103, in _recv_message
    self.server_pipe = open(self.server_pipe_path, "rb")
KeyboardInterrupt

I've started an x server using the method described above, and there is an Xorg process visible on my nvidia-smi and in htop with similar arguments as are shown in the output of ps aux | grep Xorg above.

Also, I've ran the following snippet, and I get the same printed output as you stated we should expect:

from ai2thor.controller import Controller

c = Controller(
  commit_id="bad5bc2b250615cb766ffb45d455c211329af17e",
  x_display=YOUR_X_DISPLAY_STRING
)
c.step("MoveAhead")

print(c.last_event.metadata["agent"])

# Prints
# {'name': 'agent', 'position': {'x': 0.25, 'y': 0.9009992480278015, 'z': -1.25}, 'rotation': {'x': -0.0, 'y': 90.0, 'z': 0.0}, 'cameraHorizon': -0.0, 'isStanding': True, 'inHighFrictionArea': False}

Thanks for the help!

Lucaweihs commented 2 years ago

Hi @brandontrabucco,

Thanks for the bug report, I've seen it happen before for rearrangement that it can freeze when starting many AI2-THOR processes on some machines. Can you try lowering the number of processes you're starting during training and checking if that fixes the issue? You can do this by editing the IL_PIPELINE_TYPE = "40proc" constant in the one_phase_rgb_clipresnet50_dagger.py file, this constant is passed to the il_training_params function in baseline_configs/one_phase/one_phase_rgb_il_base.py so you'll want to also edit that function to include a new if block, e.g. something like:

    elif label == "debug":
        lr = 3e-4
        num_train_processes = 1
        num_steps = 64
        dagger_steps = min(int(1e6), training_steps // 10)
        bc_tf1_steps = min(int(1e5), training_steps // 10)
        update_repeats = 3
        num_mini_batch = 1

Note that the existing code will try to use all of your GPUs by default (which will cause an error when trying to train with only a single process as the above block tries to do) so you'd need to also edit the GPU count (num_gpus) in the machine_params function of baseline_configs/rearrange_base.py.

allenai / allenact

Code is getting stuck after few iterations #346