GengzeZhou / NavGPT-2

[ECCV 2024] Official implementation of NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
MIT License
86 stars 1 forks source link

How use two GPUs to train Navigation Policy Network? #2

Open TLvCo opened 3 months ago

TLvCo commented 3 months ago

Due to the excessive time required to train the Navigation Policy Network using a 4090 GPU, we plan to use multiple GPUs for parallel training. Therefore, we have modified the parameters in the run_r2r_xl.sh file as follows: ngpus=2 # old value: 1

image

But both processes seem to be stuck waiting for a signal to start training, as shown below:

[INFO] Freezing the Q-Former.
[INFO] Model loaded, using 2 GPUs
[INFO] Total parameters: 1473468964
[INFO] Trainable parameters: 62724132
[INFO] Removed T5 decoder and LM head to save memory during training.
[INFO] Freezing the Q-Former.
[INFO] Model loaded, using 2 GPUs
[INFO] Total parameters: 1473468964
[INFO] Trainable parameters: 62724132
Optimizer: adamW
Optimizer: adamW

[INFO] Listener training starts, start iteration: 0

[INFO] Listener training starts, start iteration: 0
/home/tl/NavGPT-2/map_nav_src/models/lavis/models/blip2_models/blip2.py:42: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  return torch.cuda.amp.autocast(dtype=dtype)
/home/tl/NavGPT-2/map_nav_src/models/lavis/models/blip2_models/blip2.py:42: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  return torch.cuda.amp.autocast(dtype=dtype)

How can I modify it to enable parallel training with multiple GPUs?