Why does VQA model evaluation fall into an infinite loop?

TopCoder2K commented 2 years ago

Habitat-Lab and Habitat-Sim versions

Habitat-Lab: v0.2.1 (stable)

Habitat-Sim: v0.2.1

❓ Questions and Help

Why does habitat_baselines.utils.common.poll_checkpoint_folder return None when it has passed all the checkpoints? With None is returned, BaseTrainer.eval() falls into an infinite loop. Is this the expected behavior? Or maybe habitat_baselines/config/eqa/il_vqa.yaml should contain EVAL_CKPT_PATH_DIR: "data/eqa/vqa/checkpoints/epoch_50.ckpt" instead EVAL_CKPT_PATH_DIR: "data/eqa/vqa/checkpoints/"?

UPD 01/28/2022: hmm, it seems I've understood the idea: we want to evaluate checkpoints as they are created! But maybe it's worth pointing this out in the README, because I was at a loss when I first came across the infinite loop. I would close the issue, but the question below does not give.(

TopCoder2K commented 2 years ago

Also, I have an error when evaluating the nav module with python -u habitat_baselines/run.py --exp-config habitat_baselines/config/eqa/il_pacman_nav.yaml --run-type eval:

  File "/home/svyatoslav/anaconda3/envs/habitat/lib/python3.6/site-packages/torch-1.10.0-py3.6-linux-x86_64.egg/torch/utils/tensorboard/summary.py", line 490, in make_video
    clip.write_gif(filename, verbose=False, progress_bar=False)
TypeError: write_gif() got an unexpected keyword argument 'verbose'

It seems to be because of

moviepy                   2.0.0.dev2               pypi_0    pypi

When I installed moviepy=1.0.1 the error seems to have disappeared. Buuut, now I'm having another error:

Traceback (most recent call last):
  File "habitat_baselines/run.py", line 85, in <module>
    main()
  File "habitat_baselines/run.py", line 40, in main
    run_exp(**vars(args))
  File "habitat_baselines/run.py", line 81, in run_exp
    execute_exp(config, run_type)
  File "habitat_baselines/run.py", line 66, in execute_exp
    trainer.eval()
  File "/home/svyatoslav/Internship/EQA/habitat-lab/habitat_baselines/common/base_trainer.py", line 129, in eval
    checkpoint_index=prev_ckpt_ind,
  File "/home/svyatoslav/Internship/EQA/habitat-lab/habitat_baselines/il/trainers/pacman_trainer.py", line 423, in _eval_checkpoint
    config.IL.NAV.max_controller_actions,
  File "/home/svyatoslav/Internship/EQA/habitat-lab/habitat_baselines/il/data/nav_data.py", line 237, in get_hierarchical_features_till_spawn
    raw_img_feats[target_pos_idx].copy()
IndexError: index 94 is out of bounds for axis 0 with size 93
I0124 10:37:28.954592 10896 Simulator.cpp:54] Deconstructing Simulator

How can I fix it?

dhruvbatra commented 2 years ago

CC: @mukulkhanna

mathfac commented 2 years ago

@TopCoder2K

EQA IL part is contributed/maintained by @mukulkhanna. It may be slightly outdated from recent changes/libraries appeared as we have no Matterport3D scenes on Continuous Integration machines to test this part of the code.

UPD 01/28/2022: hmm, it seems I've understood the idea: we want to evaluate checkpoints as they are created! But maybe it's worth pointing this out in the README, because I was at a loss when I first came across the infinite loop. I would close the issue, but the question below does not give.(

Correct, that is most common use-case, when val curve has to be created during the training, but without pausing training itself. Would you mind to send a PR with clarification you would like to see? Thank you!
Freezing moviepy=1.0.1 in dependencies sounds like good idea and would be great to send as PR.
For some reason len(raw_img_feats) is less than len(actions) or backtrack_steps < 0. Possibly, you need to add additional logging to understand what is causing the error.

TopCoder2K commented 2 years ago

@mathfac Thank you for the detailed response!

Hmm, the README.md looks really good, so maybe it's enough to be here in the issues. Or maybe it's worth adding a footnote to the 'Eval' section. What do you think?
Then I also want to ask if you have encountered the problem of missing Cython, pkgconfig and h5py? I had to install them before pip install -r requirements.txt had finished without errors. Do they also have to be added? And do I need to commit it in a special branch (such as il_fixes) or do I need just commit it in the main and open PR?
I'll try to find the reason, as I need this functionality in my research :) If you have any other ideas, please share them!

facebookresearch / habitat-lab

Why does VQA model evaluation fall into an infinite loop? #796

Habitat-Lab and Habitat-Sim versions

❓ Questions and Help