Training Issues and Training Resources

yusirhhh commented 7 months ago

Thank you for your great work. I'm interested in reproducing your results.

python run.py --exp-config ./configs/experiments/XGX.yaml --run-type train

However, I encountered an issue regarding:

"File "/mnt/data/mmyu/objectNav/XGX/habitat-lab/habitat_baselines/rl/ppo/ppo_trainer.py", line 596, in _update_agent self.rollouts, self.config.LMN_LOSS_IGNORE File "/mnt/data/mmyu/objectNav/XGX/pirlnav/algos/ppo.py", line 110, in update for batch in data_generator: File "/mnt/data/mmyu/objectNav/XGX/habitat-lab/habitat_baselines/common/rollout_storage.py", line 207, in recurrent_generator num_environments, num_mini_batch AssertionError:  Trainer requires the number of environments (1) to be greater than or equal to the number of trainer mini batches (2).

Could you provide some suggestions?

I noticed in your paper you used 64 GPUs to train the model, but I only have a GPU server with 8 A6000 GPUs. Can I train the model with 8 A6000 GPUs?"

Jbwasse2 commented 7 months ago

Regarding the error

Try setting (in XGX.yaml)

NUM_ENVIRONMENTS: 1

to

NUM_ENVIRONMENTS: 2

Regarding your question

It should still train, but it will be much slower.

yusirhhh commented 7 months ago

Thank you for your response. Could you please provide instructions on training the model from scratch, encompassing both IL learning and fine-tuning with RL learning? I aim to reproduce the results.

Regarding the training time: I'm interested in knowing the duration of training using 64 GPUs.

Jbwasse2 commented 6 months ago

The details of using IL and finetuning with RL can be found in habitat-web and PIRLNav respectively.

It took me about 24 hours to train.

Jbwasse2 / XGX

Training Issues and Training Resources #3

Regarding the error

Regarding your question