Closed rob-hen closed 6 months ago
Hi. Thanks for your interest.
No. In this stage, we keep lambda=1.
Yes.
No. Stable Diffusion 2.1-base is trained on 512x512 with eps (instead of v-prediction), we adopt that in the first stage and only use the WebVid10M dataset for training.
Almost yes. In fact, we adopt random selection in the stages after initial training stage on the T2I model.
Yes.
Great thanks for all the answers. Sorry for the late reply (I was on vacation).
I have three more questions:
6.No. We train it with resolution 512x512. Since in the initial training stage, we keep image-SDv2.1 fixed and train P only. The image-SDv2.1 is at a resolution of 512x512.
7-1. Yes. The mentioned conv is also regarded as spatial layers. 7-2 No. Sorry for confusion. Spatial layers means all spatial layers in the VideoCrafter (the same as image SDv2.1 U-Net's layers, in contrast to temporal layers added by VideoCrafter to image SDv2.1. BTW, VideoCrafter is built on image SDv2.1 by adding temporal layers). As in this second training stage, we have equipped the VideoCrafter (T2V model) with our P (trained on image SDv2.1), we would like to fuse them by finetuning.
8.Exactly.
Hi, first all this is really great work. Thanks for releasing the code.
I have a few questions about the training.
lambda = tanh(alpha)
?You mention later: to avoid learning short-cuts, you randomly select a video frame as image condition. Two questions about that: