bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.
GNU Affero General Public License v3.0
1.57k stars 139 forks source link

Bug in Flux with Masked Attention #904

Closed zhuole1025 closed 2 weeks ago

zhuole1025 commented 3 weeks ago

Hi! During reading your code, I found that the mask attention concatenates mask = [image_mask, text_mask]. However, the order is reverse for attention computation, e.g., q = [q_text, q_image]. I am not sure if this will cause bugs. https://github.com/bghira/SimpleTuner/blob/cea2457ab063f6dedb9e697830ae68a96be90641/helpers/models/flux/transformer.py#L314

bghira commented 3 weeks ago

@AmericanPresidentJimmyCarter

AmericanPresidentJimmyCarter commented 2 weeks ago

Fixed in #908 . I didn't notice that the location of text had been swapped versus sd3/auraflow and it was not obvious in diffusers because they just call image and text "hidden_states" and "encoder_hidden_states" respectively.