About distributed training (CLIP-IQA)

chaofengc / IQA-PyTorch

👁️ 🖼️ 🔥PyTorch Toolbox for Image Quality Assessment, including PSNR, SSIM, LPIPS, FID, NIQE, NRQM(Ma), MUSIQ, TOPIQ, NIMA, DBCNN, BRISQUE, PI and more...

https://iqa-pytorch.readthedocs.io/

Other

2.02k stars 175 forks source link

About distributed training (CLIP-IQA) #155

Open lliee1 opened 7 months ago

lliee1 commented 7 months ago

Hi, I appreciate your repos. I've been using clip-iqa model in your repo for studying purpose. It worked well on single-gpu setting when I follow your simple training scripts.

I want to use distributed settings (single-node, num_gpu=2). I tried simply changing argument num_gpu: 1 -> 2 in yaml file. and I encountered some device incorrect errors at LayerNorm and PromptLearner forward part. errors

What are some solutions for distributed setting I can try?

Additionally, I include the command I used at the bottom. "python pyiqa/train.py -opt options/train/CLIPIQA/train_CLIPIQA_koniq10k.yml"

Much appreciated, lliee

chaofengc commented 7 months ago

Thanks to your feedback. Training with DataParallel can sometimes be annoying. It is better to use DistributedDataParallel as recommended by official document

You may git pull to the latest version and use DistributedDataParallel with the following command

torchrun --nproc_per_node=2 --master_port=4321 pyiqa/train.py -opt options/train/CLIPIQA/train_CLIPIQA_koniq10k.yml --launcher pytorch

In such case, please set num_gpus=1 in the yml config file.

lliee1 commented 6 months ago

Thank you for your reply!

Since I'm currently working on something, I'll leave a comment if I get any problems.