Open bonlime opened 1 year ago
@drboog Could you please answer another question? In your codebase everything seems to be prepared to train (and use) down_block
+ mid_block
residual prediction together with prompt prediction. Have you tried actually training the residuals? does it work? Or you just left it to publish another paper?
First, thanks for a very interesting paper. Looking through your code I see that you pass use
_build_causal_attention_mask
and pass this attention mask to text encoder during training, which indeed seems to make sense. But all official examples in 🤗 diffusers don't provide attention masks to text encoder during training (in TI and DB training scripts), why is that? I also see your comment # the causal mask is important, don't forget it if you try to customize the code Do you think passing this mask could improve the results for vanilla TI training as well? As a side note, in my custom TI training passingattention_mask
for padded tokens significantly improves the convergence speed and final results, but now I think that in addition to that causal mask is needed.Also in the paper you mention segmenting peoples / faces and inpainting then, but I can't see this in the train.py. Also have you tried just applying this mask to predictions, instead of using inpaining? it seems as faster and also potentially better approach. See this for details: https://github.com/cloneofsimo/lora/discussions/96