baaivision / Emu3

Next-Token Prediction is All You Need
Apache License 2.0
949 stars 26 forks source link

No QK Norm? How it compares to Chameleon? #19

Open DEBIHOOD opened 1 week ago

DEBIHOOD commented 1 week ago

Hi, thanks for your brilliant work, release of the paper, weights(as far as i understood, there's more to be released!), and code. I'm very thrilled by your achievements in omni-modal field, it really starts to feel like future with open-source releases like that.

Your approach is pretty similar to Chameleon, but throughout their paper, they mention that Chameleon suffered from instability in training because of presence of image tokens. They had to use QK Norm to prevent it from collapsing. Looking through your paper i can't really find any mentions of QK norm. Does that mean that your approach doesn't uses QK Norm? Did it suffered from instabillity like Chameleon's config without QK Norm did? If not, why?

In general, i don't fully understand how it differs from Chameleon, and still being better than it. Meta done a few follow-up papers where they tried to explore how they could build up on Chameleon's ideas without it's drawbacks (instability, poor performance compared to specialized single-modal models), like incorporating diffusion into transformers (transfusion, you mentioned that in the paper), and other one where they tried mixture-of-experts, with specialized experts for each modality (MoMa). And then your paper comes out basically saying "Chameleon-like approach is fine, and in fact it's better than others".

I probably miss something. This would be awesome if you could explain it to me. Thank you for what you're doing.

thaoshibe commented 4 days ago

(In my biased opinion, please correct me if I'm wrong) Emu3 is indeed Chamleon-family. But Emu3 was trained with better vision-encoder/decoder (e.g., SBER-MoVQGAN); larger scale dataset (e.g., including videos); and followed-up with post-training (e.g., DPO).

I am not trying to protect Emu3 (I'm not authors), but while I agree that Emu3 is indeed "another Chameleon" -- I don't quite agree with "And then your paper comes out basically saying "Chameleon-like approach is fine, and in fact it's better than others".

Chameleon hasn't even published the image-generation ability. A public "Chameleon" like Emu3 is indeed super helpful for follow-up research on multimodal models ^^.