drboog / Shifted_Diffusion

Code for Shifted Diffusion for Text-to-image Generation (CVPR 2023)
Creative Commons Zero v1.0 Universal
159 stars 11 forks source link

how much memory do we need for training? #5

Closed ilovecv closed 1 year ago

ilovecv commented 1 year ago

Hi,

May I know your machine configuration for training? And how to do distributed training with multiple nodes? Thanks!

BTW, need to add center crop for the clip image preprocessing after this line: https://github.com/drboog/Shifted_Diffusion/blob/main/train.py#L353

drboog commented 1 year ago

Hi, our pre-trained model was trained on 80GB A100. If you meet out of memory problem, I think you can try to use a smaller transformer, use a smaller t5 model or other text encoder, use a small batch size.

Previously, our multi-mode training was based on some company internal libraries, this repo is re-written by me based on accelerate. I'm not 100 percent sure whether it works on multi-node, but I think you should set up config files and run training script on all the machines.

Thanks for mentioning, I'll add that.

ilovecv commented 1 year ago

Thank you very much for your reply. For your multi-node training, did you extract the embeddings of text and image offline and load them during training or do it on-the-fly as this repo?

drboog commented 1 year ago

They are extracted on the fly.

ilovecv commented 1 year ago

Thank you very much for your quick reply. In the paper, for the training sets, it is mentioned that a combination of cc3m, cc12m, filtered laion 5b and some internal datasets. How do you filter the laion 5B? I am wondering if you have any suggestions regarding which public dataset we can use to train a good prior model.

drboog commented 1 year ago

The dataset filter is not implemented by me, so I'm not sure about some details. But I think they filtered out some images with sensitive information, filtered out some unmatched image-text pairs which have low CLIP similarity, and filtered out images whose size or resolution is too small.

ilovecv commented 1 year ago

Thank you very much! For the text encoder, may I know why you used t5-11b instead of t5-xxl? Are they any reasons behind it?

drboog commented 1 year ago

No, there is no specific reason. We just want to set a good baseline. We didn't investigate what kind of text encoder (different size, architecture, pre-training objective... ) will lead to better result in this task.

ilovecv commented 1 year ago

Thanks!