allenai / allenact

An open source framework for research in Embodied-AI from AI2.
https://www.allenact.org
Other
313 stars 50 forks source link

Training got stuck after creating Vector Sample Tasks #309

Closed npmhung closed 2 years ago

npmhung commented 3 years ago

As mention in the title, when I tried to run the following command:

python main.py projects/objectnav_baselines/experiments/robothor/objectnav_robothor_rgb_resnetgru_ddppo.py

the training process just got stuck at the following step forever:

image

This never happens in my personal desktop.

I couldn't figure out what the potential problem is.

Server configuration: Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz 48 CPU(s) 2x Tesla K80

Lucaweihs commented 3 years ago

Hi @npmhung,

Just to double check, can you try starting AI2-THOR instances on each of your x-displays and confirming that everything works as expected:

from ai2thor.controller import Controller
c = Controller(x_display="0.0")
c.step("RotateRight")
print(f"For display 0.0, action was successful: {c.last_event.metadata['lastActionSuccess']}.")
c.stop()

c = Controller(x_display="0.1")
c.step("RotateRight")
print(f"For display 0.1, action was successful: {c.last_event.metadata['lastActionSuccess']}.")

The above should print success messages for both displays. If that doesn't work then you should double check that you have started the x-display (this can be done by running sudo ai2thor-xorg start).

If the above isn't the problem, can you try reducing the number of training processes to 1 and seeing if training still hangs?

One last thing: assuming your questions were answered for issue #308, can you close it?

npmhung commented 3 years ago

I got the following message:

For display 0.0, action was successful: True. For display 0.1, action was successful: True.

npmhung commented 3 years ago

Yes, it still hangs with 1 process.

Lucaweihs commented 3 years ago

Ok great, in the image you linked, there is a key/value pair with the key named "env_args" can you copy the dictionary there into a new variable and try the following:

from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = ... # The dictionary from the image

env = RoboThorEnvironment(**env_args)

env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")
Lucaweihs commented 3 years ago

I did a bit of debugging and it seems that AI2-THOR does not like it when you use odd integers in the height/width of the window. Can you change the window size from 300x225 to 304x228 and try again?

npmhung commented 3 years ago

Ok, I will try that.

Besides, fyi, this line "env = RoboThorEnvironment(**env_args)" hangs on my server.

npmhung commented 3 years ago

I tried as you suggested. However, changing the resolution didn't help at all in my case.

Lucaweihs commented 3 years ago

Can you double check that this is also the case with env = RoboThorEnvironment(**env_args), i.e.

from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = ... # The dictionary from the image

env_args["height"] = 228
env_args["width"] = 304
env = RoboThorEnvironment(**env_args)

env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")
npmhung commented 3 years ago

This line "env = RoboThorEnvironment(**env_args)" still hangs on my side.

Lucaweihs commented 3 years ago

The env = RoboThorEnvironment(**env_args) definitely seems to be the issue. Can you confirm that:

env = RoboThorEnvironment()
env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")

works?

If so, could you try the following:

new_env_args = {}
for key, val in env_args.items():
    print(f"Trying to add key {key} with value {val}")
    new_env_args[key] = val
    env = RoboThorEnvironment(**new_env_args)
    env.step(action="RotateRight")
    assert env.last_event.metadata['lastActionSuccess']

    print(f"Env successfully started with env_args == {new_env_args}")
    env.stop()

This should eventually hang and tell us what is causing the issue.

npmhung commented 3 years ago

I ran your code, and the key commit_id is causing the issue. Only after removing that key, the code finished successfully.

Lucaweihs commented 3 years ago

I see, this commit id is what determines the AI2-THOR build that should be used. It's unlikely but possible that that this file was corrupted while downloading. Can you try just starting an AI2-THOR controller with this commit id:

from ai2thor.controller import Controller
Controller(commit_id="bad5bc2b250615cb766ffb45d455c211329af17e")

and seeing if that hangs for you?

If so, I'd suggest deleting the

thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.lock

directory/file within the ~/.ai2thor/releases directory and then trying to run Controller(commit_id="bad5bc2b250615cb766ffb45d455c211329af17e") again (it should download an ~500mb file before starting).

npmhung commented 3 years ago

It couldn't start downloading and get stuck even after deleting those 2 files.

Could it be that the server's firewall is preventing me from downloading?

npmhung commented 3 years ago

I can try to upload those files from my desktop if that could help?

Lucaweihs commented 3 years ago

It's possible that the firewall is causing issues, here's a potential workaround:

cd ~/.ai2thor/releases
wget http://s3-us-west-2.amazonaws.com/ai2-thor-public/builds/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
unzip thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip -d thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e

This should manually download the files and unzip them into the correct location. If you can't download the zip file with wget then you'll probably have to download it locally and upload it.

Lucaweihs commented 3 years ago

You might also want to check the md5 hash of the zip file to make sure it was downloaded correctly:

$ md5sum thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
dfc4dc0f7bfdb2254221ae35fc712363  thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
npmhung commented 3 years ago

I downloaded and checked the md5 hash sum. The data is correct, but the Controller couldn't load it still.

Lucaweihs commented 3 years ago

This is quite odd, especially as you're able to successfully run other builds. Can you try running

cd ~/.ai2thor/releases/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
DISPLAY=:0.0 ./thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e  -screen-fullscreen 0 -screen-quality 1 -screen-width 300 -screen-height 300

and then logging into the server from another terminal window and checking that

nvidia-smi

shows around 34mb of memory being used by the AI2-THOR unity process? Here's what it looks like for me:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4624      G   /usr/lib/xorg/Xorg                 58MiB |
|    0   N/A  N/A     14800      G   ...b766ffb45d455c211329af17e       34MiB |

the process with name ...b766ffb45d455c211329af17e is unity process.

@ekolve - would you have any ideas what might be causing the issue?

ekolve commented 3 years ago

One thing you could try checking is whether there are any other python processes running that may have ai2thor loaded. There are a few places where we take exclusive locks to ensure that only one process downloads a release as well as when we prune releases. The block of code that gets locked is very small and should never fail or hang, but its worth checking. If there are other processes, try killing them. As well, could you report what version of ai2thor you are running by running this:

print(ai2thor.__version__)
npmhung commented 3 years ago

@Lucaweihs The Display command shows up in the nvidia-smi just like what you show me.

@ekolve I'm using ai2thor 3.3.5

Lucaweihs commented 3 years ago

This is a bit of an extreme measure, but could you:

  1. Delete the ~/.ai2thor directory.
  2. Delete and then reinstall the virtual environment you're using using the --no-cache-dir option when you pip install ai2thor.
  3. Run
    from ai2thor.controller import Controller
    Controller(x_display="0.0", commit_id="bad5bc2b250615cb766ffb45d455c211329af17e")

    and check if it does/doesn't hang.

My best guess (suggested by @jordis-ai2 ) is that perhaps the initial call to use a window display with height 225 is being cached somewhere and is causing the problem. If the above still hangs can you try running

from ai2thor.controller import Controller
Controller(x_display="0.0", commit_id="dd25cb479958e915e2ed1282062345b0f81dc4e2")

to see if that also hangs?

ekolve commented 3 years ago

In addition to @Lucaweihs's instructions, if it does hang enter CTRL-C to interrupt the process. You should hopefully get a stack trace from python where the process was hung.

Lucaweihs commented 2 years ago

I'm going to close this issue. Please feel free to reopen if you're still having trouble.