IndexError while evaluating baseline social nav

RAraghavarora commented 8 months ago

Habitat-Lab and Habitat-Sim versions

Habitat-Lab: v0.3.0

Habitat-Sim: v0.3.0

Habitat is under active development, and we advise users to restrict themselves to stable releases. Are you using the latest release versions of Habitat-Lab and Habitat-Sim? Your question may already be addressed in the latest versions. We may also not be able to help with problems in earlier versions because they sometimes lack the more verbose logging needed for debugging.

Master branch contains 'bleeding edge' code and should be used at your own risk.

Docs and Tutorials

Did you read the docs? https://aihabitat.org/docs/habitat-lab/

Yes

Did you check out the tutorials? https://aihabitat.org/tutorial/2020/

Yes

Perhaps your question is answered there. If not, carry on!

❓ Questions and Help

I trained the habitat_baseline for social_nav with the following command: srun python -u -m habitat_baselines.run --config-name=social_nav/social_nav.yaml

It trained for 3 days before the job reached the time limit. Tensorboard only shows time_series and scalars (and no images).

When trying to evaluate it, I run the following: srun python -u -m habitat_baselines.run --config-name=social_nav/social_nav.yaml habitat_baselines.evaluate=True habitat_baselines.eval_ckpt_path_dir=data/checkpoints/latest.pth habitat_baselines.eval.should_load_ckpt=True

I get the following error:

Error executing job with overrides: ['habitat_baselines.evaluate=True', 'habitat_baselines.eval_ckpt_path_dir=data/checkpoints/latest.pth', 'habitat_baselines.eval.should_load_ckpt=True']
Traceback (most recent call last):
  File "habitat-lab/habitat-baselines/habitat_baselines/run.py", line 31, in main
    execute_exp(cfg, "eval" if cfg.habitat_baselines.evaluate else "train")
  File "habitat-lab/habitat-baselines/habitat_baselines/run.py", line 62, in execute_exp
    trainer.eval()
  File "habitat-lab/habitat-baselines/habitat_baselines/common/base_trainer.py", line 129, in eval
    self._eval_checkpoint(
  File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 889, in _eval_checkpoint
    evaluator.evaluate_agent(
  File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/habitat_evaluator.py", line 89, in evaluate_agent
    rgb_frames: List[List[np.ndarray]] = [
  File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/habitat_evaluator.py", line 92, in <listcomp>
    {k: v[env_idx] for k, v in batch.items()}, {}
  File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/habitat_evaluator.py", line 92, in <dictcomp>
    {k: v[env_idx] for k, v in batch.items()}, {}
IndexError: index 12 is out of bounds for dimension 0 with size 12

RAraghavarora commented 8 months ago

When I try training the baseline for any other config, I get the following error traceback:

Error executing job with overrides: []
Traceback (most recent call last):
  File "habitat-lab/habitat-baselines/habitat_baselines/run.py", line 31, in main
    execute_exp(cfg, "eval" if cfg.habitat_baselines.evaluate else "train")
  File "habitat-lab/habitat-baselines/habitat_baselines/run.py", line 60, in execute_exp
    trainer.train()
  File "miniconda3/envs/habitat/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 664, in train
    self._init_train(resume_state)
  File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 245, in _init_train
    self._init_envs()
  File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 143, in _init_envs
    self.envs = env_factory.construct_envs(
  File "habitat-lab/habitat-baselines/habitat_baselines/common/habitat_env_factory.py", line 111, in construct_envs
    envs = vector_env_cls(
  File "habitat-lab/habitat-lab/habitat/core/vector_env.py", line 207, in __init__
    self.observation_spaces = [
  File "habitat-lab/habitat-lab/habitat/core/vector_env.py", line 208, in <listcomp>
    read_fn() for read_fn in self._connection_read_fns
  File "habitat-lab/habitat-lab/habitat/core/vector_env.py", line 110, in __call__
    res = self.read_fn()
  File "habitat-lab/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
    buf = self.recv_bytes()
  File "miniconda3/envs/habitat/lib/python3.9/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "miniconda3/envs/habitat/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "miniconda3/envs/habitat/lib/python3.9/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Jinghan11 commented 7 months ago

Hi

Did you solved this problem? I encountered the same IndexError when evaluating the tranning result of social_nav.

RAraghavarora commented 7 months ago

@Jinghan11 No, I still haven't found a solution 😞

jimmytyyang commented 7 months ago

Hi @RAraghavarora and @Jinghan11! Thank you so much for the question. We recently updated the readme for training and evaluating social nav skills, and we have released the checkpoint. The error "IndexError: index 12 is out of bounds for dimension 0 with size 12" seems to be the fact that you are evaluating a checkpoint with more than 1 GPU and 1 ntasks-per-node. Could you please reduce them into 1 and try it again?

#SBATCH --gres gpu:1
#SBATCH --cpus-per-task 10
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --mem-per-cpu=6GB

Thank you, and please let us know if you have any further questions.

Zeying-Gong commented 6 months ago

I have also encountered this issue, and by several attempts, I fixed it by simply modified the value of num_environments from 18 to 12 in habitat-lab/habitat-baselines/habitat_baselines/config/social_nav/social_nav.yaml. The surface-level problem seems to be in habitat-lab/habitat-baselines/habitat_baselines/common/habitat_env_factory.py, which adjusts the number of environments to match the number of scenes. This is evident from the initial output message:

There are less scenes (12) than environments (18). Reducing the number of environments to be the number of scenes

I noticed that official modification in the main branch. Thank you for pointing this out @jimmytyyang, the checkpoint also works for me.

facebookresearch / habitat-lab