Open lliee1 opened 7 months ago
Thanks to your feedback. Training with DataParallel
can sometimes be annoying. It is better to use DistributedDataParallel
as recommended by official document
You may git pull
to the latest version and use DistributedDataParallel
with the following command
torchrun --nproc_per_node=2 --master_port=4321 pyiqa/train.py -opt options/train/CLIPIQA/train_CLIPIQA_koniq10k.yml --launcher pytorch
In such case, please set num_gpus=1
in the yml
config file.
Thank you for your reply!
Since I'm currently working on something, I'll leave a comment if I get any problems.
Hi, I appreciate your repos. I've been using clip-iqa model in your repo for studying purpose. It worked well on single-gpu setting when I follow your simple training scripts.
I want to use distributed settings (single-node, num_gpu=2). I tried simply changing argument num_gpu: 1 -> 2 in yaml file. and I encountered some device incorrect errors at LayerNorm and PromptLearner forward part.
What are some solutions for distributed setting I can try?
Additionally, I include the command I used at the bottom. "python pyiqa/train.py -opt options/train/CLIPIQA/train_CLIPIQA_koniq10k.yml"
Much appreciated, lliee