Closed yds3 closed 3 weeks ago
Thanks for pointing out the question. Did you encounter this problem after the first iteration or a few iterations?
Thank you for your reply. I followed the instructions in the readme file and skipped Stage 1: Visual Instruction Tuning of NavGPT-2, proceeding directly to Stage 2: Policy Finetuning of NavGPT-2. However, when I ran run_r2r_xl.sh, I encountered the error right at the start, which I believe occurred during the first iteration.
Additionally, could I trouble you to let me know the memory size of a single A100 GPU? When I tried to run run_r2r_xxl.sh on a 16GB GPU, I received an out-of-memory error.
I will debug and fix the problem later. The problem could happen because I refined the codebase when releasing.
Here is some hint to solve the issue, the problem could happen when loading model weight with float16 and training with float32, the weights would overflow after the first update. Check the weights of the model after loading it or directly transferring them to full precision might help. Otherwise, check the input data and make sure the input features are not Nan.
For training the xxl model we used an 80GB A100 GPU, but this could be done on a single 40GB A100 by setting the accumulate_gradient_steps or simply removing the very long trajectories in the training set.
Thank you very much for your reply.
Hello, your paper is excellent, but when I ran the script, the code encountered an error as shown in the attached image. Could you please advise on how to resolve this issue?