XiangLi1999 / Diffusion-LM

Diffusion-LM
Apache License 2.0
1.03k stars 134 forks source link

Train on Multi GPU #20

Open henrydylan opened 2 years ago

henrydylan commented 2 years ago

Hi Lisa, I have been successfully run the training schedule of your code, but during training it turns out that the GPU memory is not enough. The thing is that the cluster that I am using has multiple GPUs, and the training schedule didn't seem to utilize this fact. I wonder if there is a way to train Diffusion LM on multiple GPUs? Thank you very much if you would know how, I am just a new bee in this field.

henrydylan commented 2 years ago

By the way, this is the traceback by directly running the training schedule:

Traceback (most recent call last): File "scripts/train.py", line 208, in main() File "scripts/train.py", line 143, in main TrainLoop( File "/mnt/clam/hhchen/Diffusion-LM/improved-diffusion/improved_diffusion/train_util.py", line 100, in init self.ema_params = [ File "/mnt/clam/hhchen/Diffusion-LM/improved-diffusion/improved_diffusion/train_util.py", line 101, in copy.deepcopy(self.masterparams) for in range(len(self.ema_rate)) File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 146, in deepcopy y = copier(x, memo) File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 205, in _deepcopy_list append(deepcopy(a, memo)) File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 153, in deepcopy y = copier(memo) File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/site-packages/torch/nn/parameter.py", line 32, in deepcopy result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 475.48 MiB already allocated; 3.81 MiB free; 522.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

henrydylan commented 2 years ago

Sorry... my bad. It seems some other guy had been taking up all the GPU resources without noticing it. Now I can run the code! But I still want to know the answer of this question anyway...

Junyi42 commented 2 years ago

Hi, I was also trying to train the model on multiple GPUs some weeks ago, and I just followed the settings in repo iDDPM, and you can simply do it by modified the code of scripts/run_train.py

in line 100: f"python scripts/train.py " \

change it to: f"mpiexec -n 2 python scripts/train.py" \

and here, 2 is the number of GPUs you want to parallelize, it works fine on my cluster, hope this will help you.