Closed ilovecv closed 1 year ago
Hi, our pre-trained model was trained on 80GB A100. If you meet out of memory problem, I think you can try to use a smaller transformer, use a smaller t5 model or other text encoder, use a small batch size.
Previously, our multi-mode training was based on some company internal libraries, this repo is re-written by me based on accelerate. I'm not 100 percent sure whether it works on multi-node, but I think you should set up config files and run training script on all the machines.
Thanks for mentioning, I'll add that.
Thank you very much for your reply. For your multi-node training, did you extract the embeddings of text and image offline and load them during training or do it on-the-fly as this repo?
They are extracted on the fly.
Thank you very much for your quick reply. In the paper, for the training sets, it is mentioned that a combination of cc3m, cc12m, filtered laion 5b and some internal datasets. How do you filter the laion 5B? I am wondering if you have any suggestions regarding which public dataset we can use to train a good prior model.
The dataset filter is not implemented by me, so I'm not sure about some details. But I think they filtered out some images with sensitive information, filtered out some unmatched image-text pairs which have low CLIP similarity, and filtered out images whose size or resolution is too small.
Thank you very much! For the text encoder, may I know why you used t5-11b instead of t5-xxl? Are they any reasons behind it?
No, there is no specific reason. We just want to set a good baseline. We didn't investigate what kind of text encoder (different size, architecture, pre-training objective... ) will lead to better result in this task.
Thanks!
Hi,
May I know your machine configuration for training? And how to do distributed training with multiple nodes? Thanks!
BTW, need to add center crop for the clip image preprocessing after this line: https://github.com/drboog/Shifted_Diffusion/blob/main/train.py#L353