Open johnmullan opened 1 year ago
Diffusers defines it in terms of number of attention heads:
num_attention_heads,
out_channels // num_attention_heads,
in_channels=out_channels,
This repo uses number of channels per head:
in_channels // attn_num_head_channels,
attn_num_head_channels,
in_channels=in_channels,
Given that in_channels == out_channels these two are identical.
More of a question really, but do you know why the num_attention_heads and attention_head_dim are opposite when initialising Transformer2D blocks?
https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/79e13d17167f66f424a8acad88e83fc76d6d210d/models/unet_3d_blocks.py#L286C17-L286C35
It is opposite in unit_2d_blocks.py https://github.com/huggingface/diffusers/blob/5439e917cacc885c0ac39dda1b8af12258e6e16d/src/diffusers/models/unet_2d_blocks.py#L872