ExponentialML / Text-To-Video-Finetuning

Finetune ModelScope's Text To Video model using Diffusers 🧨
MIT License
666 stars 107 forks source link

Transformer2D initializing #82

Open johnmullan opened 1 year ago

johnmullan commented 1 year ago

More of a question really, but do you know why the num_attention_heads and attention_head_dim are opposite when initialising Transformer2D blocks?

https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/79e13d17167f66f424a8acad88e83fc76d6d210d/models/unet_3d_blocks.py#L286C17-L286C35

It is opposite in unit_2d_blocks.py https://github.com/huggingface/diffusers/blob/5439e917cacc885c0ac39dda1b8af12258e6e16d/src/diffusers/models/unet_2d_blocks.py#L872

JCBrouwer commented 1 year ago

Diffusers defines it in terms of number of attention heads:

num_attention_heads,
out_channels // num_attention_heads,
in_channels=out_channels,

This repo uses number of channels per head:

in_channels // attn_num_head_channels,
attn_num_head_channels,
in_channels=in_channels,

Given that in_channels == out_channels these two are identical.