Is there any bug in text2video generation mode?

Vchitect / Latte

Latte: Latent Diffusion Transformer for Video Generation.

Apache License 2.0

1.65k stars 177 forks source link

Is there any bug in text2video generation mode? #4

Closed howardgriffin closed 8 months ago

howardgriffin commented 9 months ago

When using 'args.extras=78', that is, text2video generation mode, I noticed this line https://github.com/maxin-cn/Latte/blob/c4df091565fa6675f39d2fd1f8292295e202a43a/train.py#L221 using pooled-text-embeddings([batch, 768]) instead of text-embeddings([batch, 77, 768]) , which is not compatible with this line https://github.com/maxin-cn/Latte/blob/c4df091565fa6675f39d2fd1f8292295e202a43a/models/latte.py#L241

As a result, I got this error RuntimeError: mat1 and mat2 shapes cannot be multiplied (5x768 and 59136x1152)

howardgriffin commented 9 months ago

Should I use text-embeddings instead of pool-text-embeddings?

maxin-cn commented 9 months ago

Should I use text-embeddings instead of pool-text-embeddings?

The text-to-video code in this repository is inconsistent with the text-to-video method described in the paper. Please wait two or three days, I will update the latest text-to-video-related code.

howardgriffin commented 9 months ago

Waiting for your great work!

maxin-cn commented 8 months ago

Should I use text-embeddings instead of pool-text-embeddings?

The text-to-video code in this repository is inconsistent with the text-to-video method described in the paper. Please wait two or three days, I will update the latest text-to-video-related code.

I have updated the text-to-video sample codes and its checkpoint. You can run bash sample/t2v.sh to obtain videos. It's hard to provide a training T2V code due to the data storage. But I think it is easy for you to modify the T2V sample code to its training counterpart on your dataset.