Transformer2D initializing

ExponentialML / Text-To-Video-Finetuning

Finetune ModelScope's Text To Video model using Diffusers 🧨

MIT License

666 stars 107 forks source link

Open johnmullan opened 1 year ago

johnmullan commented 1 year ago

More of a question really, but do you know why the num_attention_heads and attention_head_dim are opposite when initialising Transformer2D blocks?

JCBrouwer commented 1 year ago

Diffusers defines it in terms of number of attention heads:

num_attention_heads,
out_channels // num_attention_heads,
in_channels=out_channels,

This repo uses number of channels per head:

in_channels // attn_num_head_channels,
attn_num_head_channels,
in_channels=in_channels,

Given that in_channels == out_channels these two are identical.