support multi machine and multi card training?

Vision-CAIR / MiniGPT-4

Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)

https://minigpt-4.github.io

BSD 3-Clause "New" or "Revised" License

25.32k stars 2.91k forks source link

support multi machine and multi card training? #455

Open becauseofAI opened 9 months ago

becauseofAI commented 9 months ago

Does this project support multi machine and multi card training?

I tried doing this, but it got stuck.

Can you support this and provide training scripts?

YushunXiang commented 9 months ago

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml

Crocodile-Chris commented 8 months ago

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml

I suppose --nproc-per-node specifies the number of GPUs on a machine, right? Do you know where (what file) can I specify the GPUs, i.e. something like device = torch.device('cuda:0,1')?

YushunXiang commented 8 months ago

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
I suppose --nproc-per-node specifies the number of GPUs on a machine, right? Do you know where (what file) can I specify the GPUs, i.e. something like device = torch.device('cuda:0,1')?

You can specify the GPUs through export CUDA_VISIBLE_DEVICES=0,1.

This is my script:

export CUDA_VISIBLE_DEVICES=0,1
torchrun \
    --master-port 21116 \
    --nproc-per-node=2 \
    --nnodes=1 \
    train.py \
    --cfg-path train_configs/minigptv2_finetune.yaml

Crocodile-Chris commented 5 months ago

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
I suppose --nproc-per-node specifies the number of GPUs on a machine, right? Do you know where (what file) can I specify the GPUs, i.e. something like device = torch.device('cuda:0,1')?
You can specify the GPUs through export CUDA_VISIBLE_DEVICES=0,1.

This is my script:
export CUDA_VISIBLE_DEVICES=0,1
torchrun \
    --master-port 21116 \
    --nproc-per-node=2 \
    --nnodes=1 \
    train.py \
    --cfg-path train_configs/minigptv2_finetune.yaml

Thank you!

Edisonhimself commented 3 months ago

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
I suppose --nproc-per-node specifies the number of GPUs on a machine, right? Do you know where (what file) can I specify the GPUs, i.e. something like device = torch.device('cuda:0,1')?
You can specify the GPUs through export CUDA_VISIBLE_DEVICES=0,1.

This is my script:
export CUDA_VISIBLE_DEVICES=0,1
torchrun \
    --master-port 21116 \
    --nproc-per-node=2 \
    --nnodes=1 \
    train.py \
    --cfg-path train_configs/minigptv2_finetune.yaml

Hello, I would like to ask what the master port parameter means?