Questions about training

rob-hen commented 7 months ago

Hi, first all this is really great work. Thanks for releasing the code.

I have a few questions about the training.

In the first stage of training, the query transformer P is trained on SD2.1. So I understand all weights of SD2.1 are fixed and only P is trained. Do you train at that stage also lambda = tanh(alpha)?
The number of queries in P depend on the number of frames. When you train with one query using a T2I model, and you go to the next stage to train it on a T2V model, do you initialize the F (=16) queries by repeating the one query that you trained on the T2I model?
SD2.1 is trained on 768x768 resolution using v-prediction as loss. So, do you train in the first stage P using 768x768 resolution and v-prediction? If yes, I suppose you used the LAION dataset to train it, right (as WebVid10M does not offer that resolution)?

You mention later: to avoid learning short-cuts, you randomly select a video frame as image condition. Two questions about that:

For the initial training stage on the T2I model, are you using the exact frame, and only for the last training stage, you randomly select a frame?
Is the frame you use for conditioning always contained in the input video sequence x_t ? So you trained with 16 frames. The frame you used for conditioning, was it always contained in these 16 frames?

Doubiiu commented 7 months ago

Hi. Thanks for your interest.

No. In this stage, we keep lambda=1.
Yes.
No. Stable Diffusion 2.1-base is trained on 512x512 with eps (instead of v-prediction), we adopt that in the first stage and only use the WebVid10M dataset for training.
Almost yes. In fact, we adopt random selection in the stages after initial training stage on the T2I model.
Yes.

rob-hen commented 7 months ago

Great thanks for all the answers. Sorry for the late reply (I was on vacation).

I have three more questions:

Just to confirm, in the initial training stage, you train P with SD 2.1-base on WebVid using 256x256?
In the second training stage where you train P and together with the spatial layers of the T2V model, do you also train the initial convolution and the first ResNet Block in the U-Net? If we integrate the image conditioning via P to the spatial transformer blocks, it affects only later layers but not the initial convolution and the first ResNet block. So do you train only the spatial resnet blocks (after the first spatial transformer), spatial transformers, and final convolution of the U-Net?
In the first stage, I suppose you train P and the projection matrices W'_K and W'_V to form the image keys and image values, correct?

Doubiiu commented 6 months ago

6.No. We train it with resolution 512x512. Since in the initial training stage, we keep image-SDv2.1 fixed and train P only. The image-SDv2.1 is at a resolution of 512x512.

7-1. Yes. The mentioned conv is also regarded as spatial layers. 7-2 No. Sorry for confusion. Spatial layers means all spatial layers in the VideoCrafter (the same as image SDv2.1 U-Net's layers, in contrast to temporal layers added by VideoCrafter to image SDv2.1. BTW, VideoCrafter is built on image SDv2.1 by adding temporal layers). As in this second training stage, we have equipped the VideoCrafter (T2V model) with our P (trained on image SDv2.1), we would like to fuse them by finetuning.

8.Exactly.

Doubiiu / DynamiCrafter

Questions about training #5