deepglugs / deep_imagen

scripts for running and training imagen-pytorch
38 stars 8 forks source link

About muti-GPU on a big dataset. #9

Closed zhaobingbingbing closed 1 year ago

zhaobingbingbing commented 1 year ago

Hi, Thanks for your work. I am confused about the muti-GPU training on a subset of laion, about 7M. When I try to specify the gpu id like 'CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3 imagen.py --train ' only the first gpu is used, and the training speed is very slow, one epoch will take 200 hours. When I use 'accelerate launch imagen.py' The data is processed, but it got stuck in the first epoch of training. In both cases, GPU-Util is 0%. But when I try it in a small dataset, the training and GPU-Util is normal. It seems that the problem is class DataGenerator of data_generator.py. Have you ever had similar problems, or can you give some suggestions? Thanks again.