Doubiiu / DynamiCrafter

[ECCV 2024] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
Apache License 2.0
2.09k stars 165 forks source link

How did you train Query Transformer weights? #45

Open Gasso21 opened 3 months ago

Gasso21 commented 3 months ago

Hello, I have a question while reading your paper. In this paper, you mentions the use of Query Transformer and Learnable Latent Vectors. Upon closer examination, it appears that Learnable Latent Vectors consist of weights repeated 4 times for Perceiver Attn and FF.

As I presume,

  1. To do separate training to enhance the image's details after passing through FrozenOpenCLIPImageEmbedderV2 and then through image_proj_stage_config.
  2. It seems that fine-tuning was done with the Spatial Attn (with freezing Temp Attn) alongside the mentioned weights without separate training.

If the first assumption is correct, I would like to know how you calculated the loss with the input image. or if the answer is second, I'm interested in understanding how you conducted the training. Could you please provide detailed explanations about that Query Transformer's trained weights?