Closed zhuole1025 closed 2 weeks ago
@AmericanPresidentJimmyCarter
Fixed in #908 . I didn't notice that the location of text had been swapped versus sd3/auraflow and it was not obvious in diffusers because they just call image and text "hidden_states" and "encoder_hidden_states" respectively.
Hi! During reading your code, I found that the mask attention concatenates mask = [image_mask, text_mask]. However, the order is reverse for attention computation, e.g., q = [q_text, q_image]. I am not sure if this will cause bugs. https://github.com/bghira/SimpleTuner/blob/cea2457ab063f6dedb9e697830ae68a96be90641/helpers/models/flux/transformer.py#L314