Doubiiu / DynamiCrafter

[ECCV 2024, Oral] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
https://doubiiu.github.io/projects/DynamiCrafter/
Apache License 2.0
2.46k stars 197 forks source link

Classifier free Guidance Training #8

Closed rob-hen closed 8 months ago

rob-hen commented 8 months ago

Hi, once again, your work is really great, thanks for sharing all this and providing support.

In Section 4.1 of your paper you describe to use multi-condition classifier-free guidance.

I could not find any information in the paper about specific training for that purpose. So my question is, did you randomly replaced as input for the CLIP image encoder in the "Dual-stream image injection" block the input image with the zero image during training? If so, can you provide the likelihoods for that?

Doubiiu commented 8 months ago

Hi. Thanks for your interest. Here we assume the multi-condition CFG is a common trick in diffusion-based image/video generation/editing with multi-conditions, such as Gen-1 and Instruct-Pix2Pix. Following Instruct-Pix2Pix, we randomly drop image only (zero image) for 5% of samples, drop text only (empty string) for 5% of samples, drop both of them for 5% of samples for dual-cross-attention. We did not drop image conditional latent in VDG. The detail of this can be found in Instruct-Pix2Pix and its supplement.

rob-hen commented 8 months ago

That clarifies my question. Thanks alot!

rob-hen commented 8 months ago

Do I understand it correctly, you were doing the dropping already in training phase 1, when you train the text-to-image model? Or you started with it in phase 2 (were you train the T2V model together with the frame conditioning)?

So in training phase one you were dropping with 5% the image, with 5% the text. In phase 2 and 3, you were dropping with 5% both?

Doubiiu commented 8 months ago

We adopt the same dropping strategy ("we randomly drop image only (zero image) for 5% of samples, drop text only (empty string) for 5% of samples, drop both of them for 5% of samples for dual-cross-attention.") in all training phases. "We did not drop image conditional latent in VDG" means the concatenated frame latent with noise will not be dropped.

rob-hen commented 8 months ago

Thanks alot