Open becauseofAI opened 9 months ago
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
I suppose --nproc-per-node specifies the number of GPUs on a machine, right? Do you know where (what file) can I specify the GPUs, i.e. something like device = torch.device('cuda:0,1')?
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
I suppose --nproc-per-node specifies the number of GPUs on a machine, right? Do you know where (what file) can I specify the GPUs, i.e. something like device = torch.device('cuda:0,1')?
You can specify the GPUs through export CUDA_VISIBLE_DEVICES=0,1
.
This is my script:
export CUDA_VISIBLE_DEVICES=0,1
torchrun \
--master-port 21116 \
--nproc-per-node=2 \
--nnodes=1 \
train.py \
--cfg-path train_configs/minigptv2_finetune.yaml
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
I suppose --nproc-per-node specifies the number of GPUs on a machine, right? Do you know where (what file) can I specify the GPUs, i.e. something like device = torch.device('cuda:0,1')?
You can specify the GPUs through
export CUDA_VISIBLE_DEVICES=0,1
.This is my script:
export CUDA_VISIBLE_DEVICES=0,1 torchrun \ --master-port 21116 \ --nproc-per-node=2 \ --nnodes=1 \ train.py \ --cfg-path train_configs/minigptv2_finetune.yaml
Thank you!
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigptv2_finetune.yaml
I suppose --nproc-per-node specifies the number of GPUs on a machine, right? Do you know where (what file) can I specify the GPUs, i.e. something like device = torch.device('cuda:0,1')?
You can specify the GPUs through
export CUDA_VISIBLE_DEVICES=0,1
.This is my script:
export CUDA_VISIBLE_DEVICES=0,1 torchrun \ --master-port 21116 \ --nproc-per-node=2 \ --nnodes=1 \ train.py \ --cfg-path train_configs/minigptv2_finetune.yaml
Hello, I would like to ask what the master port parameter means?
Does this project support multi machine and multi card training?
I tried doing this, but it got stuck.
Can you support this and provide training scripts?