No QK Norm? How it compares to Chameleon?

Hi, thanks for your brilliant work, release of the paper, weights(as far as i understood, there's more to be released!), and code. I'm very thrilled by your achievements in omni-modal field, it really starts to feel like future with open-source releases like that.

Your approach is pretty similar to Chameleon, but throughout their paper, they mention that Chameleon suffered from instability in training because of presence of image tokens. They had to use QK Norm to prevent it from collapsing. Looking through your paper i can't really find any mentions of QK norm. Does that mean that your approach doesn't uses QK Norm? Did it suffered from instabillity like Chameleon's config without QK Norm did? If not, why?

In general, i don't fully understand how it differs from Chameleon, and still being better than it. Meta done a few follow-up papers where they tried to explore how they could build up on Chameleon's ideas without it's drawbacks (instability, poor performance compared to specialized single-modal models), like incorporating diffusion into transformers (transfusion, you mentioned that in the paper), and other one where they tried mixture-of-experts, with specialized experts for each modality (MoMa). And then your paper comes out basically saying "Chameleon-like approach is fine, and in fact it's better than others".

I probably miss something. This would be awesome if you could explain it to me. Thank you for what you're doing.

baaivision / Emu3

No QK Norm? How it compares to Chameleon? #19