Open Alexkerl opened 5 months ago
https://huggingface.co/docs/diffusers/training/distributed_inference
The above tutorial may be helpful for distributed running, but if I want to run this program on a 2080ti of 4 * 12GB, I will still encounter an out of memory issue
Try to switch dtype to torch.bfloat16
. It seems to work on cpu
mode on 2080ti, which leads to lower speed.
Besides, you could refer to the official implementation on reducing memory usage: https://huggingface.co/docs/diffusers/main/en/optimization/memory
I set CUDA_VISIBLE_DEVICES=0,1,2,3 but but it only calculates on single GPU