Paper implementations details

armored-guitar commented 5 months ago

Hi. Thank you for your great work! I try to reproduce your code. Can you please help me to clear some details about your work: Do you use consistent self-attention for video training? At the 6th page there is a picture with architecture. There said that you compress image (2xHxWx3) into a semantic space 2xNxC, What is n? 257 (clip output) or 1 (linear projection) What is sequence length for motion transformer? If it is FxN, what is N?

Looking forward for your answer

Z-YuPeng commented 5 months ago

Hi, We extracted the intermediate features of the CLIP-Image encoder as image tokens. To reduce computational load during replication, I suggest you obtain 16 tokens based on the pre-trained model of IPAdapter.

armored-guitar commented 5 months ago

@Z-YuPeng Thank you for your answer! Can you please make several things clear about your implementation?

So, you have batch_size x 2 x n_tokens x channels, get interpolations between each token, getting batch_size x n_frames x n_tokens x channels. Then you reshape it to batch_size x (n_frames n_tokens) x channels and apply 2d positional encoding right? How you then put this n_framesn_tokens to the model? Do you make it by concatenating n_tokens for each frame?

During your computations did you used all 257 tokens?

And about token size: in implementations details you said that you use 1024 transformer token dim, which is not consistent with clip-h-14 hidden size. Can you please explain it

And what about consistent self-attention during video generation?

Z-YuPeng commented 5 months ago

The transformer prediction is carried out solely along the frame dimension. Dimensional inconsistency arises because the linear projection results in an increase in dimensions, which are then remapped back to 768 during the output. Subsequently, these NxF tokens are individually inserted into the cross-attention for each corresponding frame.

adolkhan commented 5 months ago

Hey @Z-YuPeng !

Thank you very much for your informative answers!

I would like to know a bit more about clip features. Did you use CLIPVisionModelWithProjection, or just CLIPVisionModel?

In case later was used, how did you extract the intermediate features? Do you mean last hidden state output which is 257 tokens or something else? Also you mentioned that you do predictions solely along frame dimension, I didn't quite get that part, do you mean that you predict NxF tokens (and use 2d positional encoding to encode that part) or just concat N with embeddings in some way and do predictions only in F dimension? The output of CLIPVisionModel's last hidden state is 257x1280, however in the paper you mentioned that the hidden_dim of transformer is 1024. I do get the part where we do mapping from 1024 -> 768 just before cross attention layer, but I don't quite get what is the input shape to transformer should be.

In case you used CLIPVisionModelWithProjection, what did you mean by clip intermediate features?

Thank you so much!

HVision-NKU / StoryDiffusion

Paper implementations details #107