Open RAraghavarora opened 8 months ago
When I try training the baseline for any other config, I get the following error traceback:
Error executing job with overrides: []
Traceback (most recent call last):
File "habitat-lab/habitat-baselines/habitat_baselines/run.py", line 31, in main
execute_exp(cfg, "eval" if cfg.habitat_baselines.evaluate else "train")
File "habitat-lab/habitat-baselines/habitat_baselines/run.py", line 60, in execute_exp
trainer.train()
File "miniconda3/envs/habitat/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 664, in train
self._init_train(resume_state)
File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 245, in _init_train
self._init_envs()
File "habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 143, in _init_envs
self.envs = env_factory.construct_envs(
File "habitat-lab/habitat-baselines/habitat_baselines/common/habitat_env_factory.py", line 111, in construct_envs
envs = vector_env_cls(
File "habitat-lab/habitat-lab/habitat/core/vector_env.py", line 207, in __init__
self.observation_spaces = [
File "habitat-lab/habitat-lab/habitat/core/vector_env.py", line 208, in <listcomp>
read_fn() for read_fn in self._connection_read_fns
File "habitat-lab/habitat-lab/habitat/core/vector_env.py", line 110, in __call__
res = self.read_fn()
File "habitat-lab/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "miniconda3/envs/habitat/lib/python3.9/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "miniconda3/envs/habitat/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "miniconda3/envs/habitat/lib/python3.9/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Hi
Did you solved this problem? I encountered the same IndexError when evaluating the tranning result of social_nav.
@Jinghan11 No, I still haven't found a solution 😞
Hi @RAraghavarora and @Jinghan11! Thank you so much for the question. We recently updated the readme for training and evaluating social nav skills, and we have released the checkpoint. The error "IndexError: index 12 is out of bounds for dimension 0 with size 12" seems to be the fact that you are evaluating a checkpoint with more than 1 GPU and 1 ntasks-per-node. Could you please reduce them into 1 and try it again?
#SBATCH --gres gpu:1
#SBATCH --cpus-per-task 10
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --mem-per-cpu=6GB
Thank you, and please let us know if you have any further questions.
I have also encountered this issue, and by several attempts, I fixed it by simply modified the value of num_environments from 18 to 12 in habitat-lab/habitat-baselines/habitat_baselines/config/social_nav/social_nav.yaml
. The surface-level problem seems to be in habitat-lab/habitat-baselines/habitat_baselines/common/habitat_env_factory.py
, which adjusts the number of environments to match the number of scenes. This is evident from the initial output message:
There are less scenes (12) than environments (18). Reducing the number of environments to be the number of scenes
I noticed that official modification in the main branch. Thank you for pointing this out @jimmytyyang, the checkpoint also works for me.
Habitat-Lab and Habitat-Sim versions
Habitat-Lab: v0.3.0
Habitat-Sim: v0.3.0
Habitat is under active development, and we advise users to restrict themselves to stable releases. Are you using the latest release versions of Habitat-Lab and Habitat-Sim? Your question may already be addressed in the latest versions. We may also not be able to help with problems in earlier versions because they sometimes lack the more verbose logging needed for debugging.
Master branch contains 'bleeding edge' code and should be used at your own risk.
Docs and Tutorials
Did you read the docs? https://aihabitat.org/docs/habitat-lab/
Yes
Did you check out the tutorials? https://aihabitat.org/tutorial/2020/
Yes
Perhaps your question is answered there. If not, carry on!
❓ Questions and Help
I trained the habitat_baseline for social_nav with the following command:
srun python -u -m habitat_baselines.run --config-name=social_nav/social_nav.yaml
It trained for 3 days before the job reached the time limit. Tensorboard only shows time_series and scalars (and no images).
When trying to evaluate it, I run the following:
srun python -u -m habitat_baselines.run --config-name=social_nav/social_nav.yaml habitat_baselines.evaluate=True habitat_baselines.eval_ckpt_path_dir=data/checkpoints/latest.pth habitat_baselines.eval.should_load_ckpt=True
I get the following error: