drboog / ProFusion

Code for Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach
Apache License 2.0
465 stars 28 forks source link

Implementation details #7

Open bonlime opened 1 year ago

bonlime commented 1 year ago

First, thanks for a very interesting paper. Looking through your code I see that you pass use _build_causal_attention_mask and pass this attention mask to text encoder during training, which indeed seems to make sense. But all official examples in 🤗 diffusers don't provide attention masks to text encoder during training (in TI and DB training scripts), why is that? I also see your comment # the causal mask is important, don't forget it if you try to customize the code Do you think passing this mask could improve the results for vanilla TI training as well? As a side note, in my custom TI training passing attention_mask for padded tokens significantly improves the convergence speed and final results, but now I think that in addition to that causal mask is needed.

Also in the paper you mention segmenting peoples / faces and inpainting then, but I can't see this in the train.py. Also have you tried just applying this mask to predictions, instead of using inpaining? it seems as faster and also potentially better approach. See this for details: https://github.com/cloneofsimo/lora/discussions/96

drboog commented 1 year ago
  1. You don't find attention mask in diffuser example, because that is already implemented in OpenCLIP text encoder. Please check the original code of text encoder, in huggingface transformers repo. In this work, because we will do something inside the text encoder (after embedding layer, before transformer), thus I explicitly write it out, because someone may forget about it when they try to implement similar ideas.
  2. As I mentioned in the paper, inpainting-related data augmentation is not used during pre-training, it is only used in fine-tuning.
bonlime commented 1 year ago

@drboog Could you please answer another question? In your codebase everything seems to be prepared to train (and use) down_block + mid_block residual prediction together with prompt prediction. Have you tried actually training the residuals? does it work? Or you just left it to publish another paper?