GengzeZhou / NavGPT-2

[ECCV 2024] Official implementation of NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
MIT License
82 stars 1 forks source link

An error occurred while running `run_r2r_xl.sh`. #6

Closed yds3 closed 3 weeks ago

yds3 commented 1 month ago

Hello, your paper is excellent, but when I ran the script, the code encountered an error as shown in the attached image. Could you please advise on how to resolve this issue? 微信图片_20241015123201

GengzeZhou commented 1 month ago

Thanks for pointing out the question. Did you encounter this problem after the first iteration or a few iterations?

yds3 commented 1 month ago

Thank you for your reply. I followed the instructions in the readme file and skipped Stage 1: Visual Instruction Tuning of NavGPT-2, proceeding directly to Stage 2: Policy Finetuning of NavGPT-2. However, when I ran run_r2r_xl.sh, I encountered the error right at the start, which I believe occurred during the first iteration.

Additionally, could I trouble you to let me know the memory size of a single A100 GPU? When I tried to run run_r2r_xxl.sh on a 16GB GPU, I received an out-of-memory error.

GengzeZhou commented 1 month ago

I will debug and fix the problem later. The problem could happen because I refined the codebase when releasing.

Here is some hint to solve the issue, the problem could happen when loading model weight with float16 and training with float32, the weights would overflow after the first update. Check the weights of the model after loading it or directly transferring them to full precision might help. Otherwise, check the input data and make sure the input features are not Nan.

For training the xxl model we used an 80GB A100 GPU, but this could be done on a single 40GB A100 by setting the accumulate_gradient_steps or simply removing the very long trajectories in the training set.

yds3 commented 1 month ago

Thank you very much for your reply.