I found that the code using DataParllel Instead of DistributedDataParallel.
This results in uneven memory distribution, with GPU0 full and other GPUs accounting for only half. I'm using 3090 and the batch size for each GPU training is set to a maximum of 2.
when I modify the code to use DistribtedDataParallel, the memory of each GPU is almost full (the number of batches is still 2), although the training speed becomes faster, but I would like to increase the number of batches to 4, which does not seem to work, I think the emahelper in the code takes up a lot of memory, it needs to update the parameters of the ema model on the GPU, is there any way to solve this problem?
I found that the code using DataParllel Instead of DistributedDataParallel. This results in uneven memory distribution, with GPU0 full and other GPUs accounting for only half. I'm using 3090 and the batch size for each GPU training is set to a maximum of 2. when I modify the code to use DistribtedDataParallel, the memory of each GPU is almost full (the number of batches is still 2), although the training speed becomes faster, but I would like to increase the number of batches to 4, which does not seem to work, I think the emahelper in the code takes up a lot of memory, it needs to update the parameters of the ema model on the GPU, is there any way to solve this problem?