TencentARC / Open-MAGVIT2

Open-MAGVIT2: Democratizing Autoregressive Visual Generation
Apache License 2.0
270 stars 7 forks source link

Is it possible to also align magvit2 with a text encoder, as what you do in seed? #8

Closed StarCycle closed 1 week ago

StarCycle commented 1 week ago

Hi @yxgeee @RobertLuo1 @ShiFengyuan1999 ,

Assuming we use magvit2 in image understanding, i.e., an architecture like LLaVA. It's will be easier for the LLM to understand the encoded embedding if it's aligned with a text encoder, like what you do in seed:

图片

And LAViT:

图片

Theorectially, is it possible? Would you like to do so?

Best, StarCycle

eisneim commented 1 week ago

@StarCycle I'm trying to the same thing: integrate magvit2 with Multi modal LLM like llava here is my understanding:

StarCycle commented 1 week ago

Hi @eisneim,

It's of course possible, but usually takes significant computing resources. You need to increase the vocab size and expand the embedding matrix of the LLM. The larger codebook you have, the more effort it takes to learn the new image embedding.

In the Chameleon example, their codebook size is 8192 and add 8192 new embedding to the embedding matrix. To learn these embeddings, Meta spends 856481 A100 hours to train Chameleon 7B. After doing this, Chameleon is still not the most powerful MLLM (I dont check their paper in details, is Chameleon 34B better than LLaVA 1.6 34B? Please let me know if I am wrong...). In the case of Magvit2, the codebook size is 262k, which is much more difficult to learn.

Why? If we use a continuous encoder in mllm, similar images should have similar embedding. But if we use a discrete encoder in mllm and only takes the ids of the patch tokens (i.e., an image may be represented by 16*16 ids), will similar images have similar ids? I am afraid we need to learn such similarity from scratch. @yxgeee Please let me know if I am wrong here...

So is there a shortcut to multimodal image understanding? In their code, an image is first encoded to a continuous tensor with shape of [18, h, w], and then quantized to a binary tensor with shape [18, h, w]. We may try to firstly send the continuous tensor to LLM, as LAViT did for image understanding. Then there are 2 ways to tokenize it:

To achieve higher tokenizer compression ratio, how about taking the 2nd approach?

Please add me on wechat (id: StarRingSpace) if you want to further discuss it ^_^

Best, StarCycle

yxgeee commented 1 week ago

Hi @yxgeee @RobertLuo1 @ShiFengyuan1999 ,

Assuming we use magvit2 in image understanding, i.e., an architecture like LLaVA. It's will be easier for the LLM to understand the encoded embedding if it's aligned with a text encoder, like what you do in seed:

图片

And LAViT:

图片

Theorectially, is it possible? Would you like to do so?

Best, StarCycle

Hi @StarCycle, thank you first for your interest in our SEED. If the tokenizer is already aligned to the text, it can indeed simplify the subsequent alignment with the LLM. Theoretically, it is feasible to align with the text when training the tokenizer, however, you need to carefully tune the hyper-parameters to make sure each loss converges as expected.

We don't plan to do this at the moment, as we aim to further advance the development of native multimodal models, beginning with autoregressive visual generation. Once the feasibility of native multimodal models is established, aligning with text might no longer be necessary.

StarCycle commented 1 week ago

I see, thanks!

yxgeee commented 1 week ago

Hi @eisneim,

It's of course possible, but usually takes significant computing resources. You need to increase the vocab size and expand the embedding matrix of the LLM. The larger codebook you have, the more effort it takes to learn the new image embedding.

In the Chameleon example, their codebook size is 8192 and add 8192 new embedding to the embedding matrix. To learn these embeddings, Meta spends 856481 A100 hours to train Chameleon 7B. After doing this, Chameleon is still not the most powerful MLLM (I dont check their paper in details, is Chameleon 34B better than LLaVA 1.6 34B? Please let me know if I am wrong...). In the case of Magvit2, the codebook size is 262k, which is much more difficult to learn.

Why? If we use a continuous encoder in mllm, similar images should have similar embedding. But if we use a discrete encoder in mllm and only takes the ids of the patch tokens (i.e., an image may be represented by 16*16 ids), will similar images have similar ids? I am afraid we need to learn such similarity from scratch. @yxgeee Please let me know if I am wrong here...

So is there a shortcut to multimodal image understanding? In their code, an image is first encoded to a continuous tensor with shape of [18, h, w], and then quantized to a binary tensor with shape [18, h, w]. We may try to firstly send the continuous tensor to LLM, as LAViT did for image understanding. Then there are 2 ways to tokenize it:

  • Split the continuous tensor to h*w tokens and each token has a dimension of 18.
  • Split the continuous tensor to 18 tokens and each token has a dimension of h*w.

To achieve higher tokenizer compression ratio, how about taking the 2nd approach?

Please add me on wechat (id: StarRingSpace) if you want to further discuss it ^_^

Best, StarCycle

Hi @StarCycle ,

"will similar images have similar ids? I am afraid we need to learn such similarity from scratch."

"So is there a shortcut to multimodal image understanding?"