question about rotary positional encoding (rope) for text.

yashkant commented 1 month ago

hi authors, thanks a lot for releasing the code!

i noticed that position ids for text embeddings is set to zeros at this line: https://github.com/black-forest-labs/flux/blob/c23ae247225daba30fbd56058d247cc1b1fc20a3/src/flux/sampling.py#L51

i am wondering why this is the case, esp. when compared to the image position ids being a 2D meshgrid here: https://github.com/black-forest-labs/flux/blob/c23ae247225daba30fbd56058d247cc1b1fc20a3/src/flux/sampling.py#L42

is it because t5 model already adds positional encoding to the text, perhaps?

thanks!

FDU-WeiXiZhang commented 3 weeks ago

I agree with that. Since positional info is already encoded by the pretrained text encoders, introducing a different positional embedding might confuse the learning, I guess? What confuses me is why using dim=3 for the txt_ids and img_ids? dim=2 is already enough for image token sequences? Is it for the ease of extending to video or 3D domain?

mengxuyiGit commented 4 days ago

I think the dim=3 is originally for (txt, img_h, img_w)? Now that the txt_ids are all 0s if it is the case to avoid duplicate pos embedding, the first dim may be reserved to encode other dims (such as t or a 3rd spatial dim)

black-forest-labs / flux

question about rotary positional encoding (rope) for text. #71