Open yashkant opened 1 month ago
I agree with that. Since positional info is already encoded by the pretrained text encoders, introducing a different positional embedding might confuse the learning, I guess? What confuses me is why using dim=3 for the txt_ids and img_ids? dim=2 is already enough for image token sequences? Is it for the ease of extending to video or 3D domain?
I think the dim=3 is originally for (txt, img_h, img_w)? Now that the txt_ids are all 0s if it is the case to avoid duplicate pos embedding, the first dim may be reserved to encode other dims (such as t or a 3rd spatial dim)
hi authors, thanks a lot for releasing the code!
i noticed that position ids for text embeddings is set to zeros at this line: https://github.com/black-forest-labs/flux/blob/c23ae247225daba30fbd56058d247cc1b1fc20a3/src/flux/sampling.py#L51
i am wondering why this is the case, esp. when compared to the image position ids being a 2D meshgrid here: https://github.com/black-forest-labs/flux/blob/c23ae247225daba30fbd56058d247cc1b1fc20a3/src/flux/sampling.py#L42
is it because t5 model already adds positional encoding to the text, perhaps?
thanks!