Is it possible to also align magvit2 with a text encoder, as what you do in seed?

StarCycle commented 1 week ago

Hi @yxgeee @RobertLuo1 @ShiFengyuan1999 ,

Assuming we use magvit2 in image understanding, i.e., an architecture like LLaVA. It's will be easier for the LLM to understand the encoded embedding if it's aligned with a text encoder, like what you do in seed:

And LAViT:

Theorectially, is it possible? Would you like to do so?

Best, StarCycle

eisneim commented 1 week ago

@StarCycle I'm trying to the same thing: integrate magvit2 with Multi modal LLM like llava here is my understanding:

since we have image tokens as numbers not as embeddings, we can treat image patch token as pure text token
while we train text token mixed with image token we can align image and text in the language model latent space
meta Chameleon has achieved great results using this method (with different visual tokenizer)
we might need a unified maper to map vision tokenizer 16384, audio tokenizer 10k, text tokenzier 150k into a unified id space then feed those unified ids into LLM transformer
higher tokenizer compression ratio is the key to multi image under standing, long video, long audio understanding and generation

StarCycle commented 1 week ago

Hi @eisneim,

It's of course possible, but usually takes significant computing resources. You need to increase the vocab size and expand the embedding matrix of the LLM. The larger codebook you have, the more effort it takes to learn the new image embedding.

In the Chameleon example, their codebook size is 8192 and add 8192 new embedding to the embedding matrix. To learn these embeddings, Meta spends 856481 A100 hours to train Chameleon 7B. After doing this, Chameleon is still not the most powerful MLLM (I dont check their paper in details, is Chameleon 34B better than LLaVA 1.6 34B? Please let me know if I am wrong...). In the case of Magvit2, the codebook size is 262k, which is much more difficult to learn.

Why? If we use a continuous encoder in mllm, similar images should have similar embedding. But if we use a discrete encoder in mllm and only takes the ids of the patch tokens (i.e., an image may be represented by 16*16 ids), will similar images have similar ids? I am afraid we need to learn such similarity from scratch. @yxgeee Please let me know if I am wrong here...

So is there a shortcut to multimodal image understanding? In their code, an image is first encoded to a continuous tensor with shape of [18, h, w], and then quantized to a binary tensor with shape [18, h, w]. We may try to firstly send the continuous tensor to LLM, as LAViT did for image understanding. Then there are 2 ways to tokenize it:

Split the continuous tensor to h*w tokens and each token has a dimension of 18.
Split the continuous tensor to 18 tokens and each token has a dimension of h*w.

To achieve higher tokenizer compression ratio, how about taking the 2nd approach?

Please add me on wechat (id: StarRingSpace) if you want to further discuss it ^_^

Best, StarCycle

yxgeee commented 1 week ago

Hi @yxgeee @RobertLuo1 @ShiFengyuan1999 ,

Assuming we use magvit2 in image understanding, i.e., an architecture like LLaVA. It's will be easier for the LLM to understand the encoded embedding if it's aligned with a text encoder, like what you do in seed:

And LAViT:

Theorectially, is it possible? Would you like to do so?

Best, StarCycle

Hi @StarCycle, thank you first for your interest in our SEED. If the tokenizer is already aligned to the text, it can indeed simplify the subsequent alignment with the LLM. Theoretically, it is feasible to align with the text when training the tokenizer, however, you need to carefully tune the hyper-parameters to make sure each loss converges as expected.

We don't plan to do this at the moment, as we aim to further advance the development of native multimodal models, beginning with autoregressive visual generation. Once the feasibility of native multimodal models is established, aligning with text might no longer be necessary.

StarCycle commented 1 week ago

I see, thanks!

yxgeee commented 1 week ago

Hi @eisneim,

It's of course possible, but usually takes significant computing resources. You need to increase the vocab size and expand the embedding matrix of the LLM. The larger codebook you have, the more effort it takes to learn the new image embedding.

In the Chameleon example, their codebook size is 8192 and add 8192 new embedding to the embedding matrix. To learn these embeddings, Meta spends 856481 A100 hours to train Chameleon 7B. After doing this, Chameleon is still not the most powerful MLLM (I dont check their paper in details, is Chameleon 34B better than LLaVA 1.6 34B? Please let me know if I am wrong...). In the case of Magvit2, the codebook size is 262k, which is much more difficult to learn.

Why? If we use a continuous encoder in mllm, similar images should have similar embedding. But if we use a discrete encoder in mllm and only takes the ids of the patch tokens (i.e., an image may be represented by 16*16 ids), will similar images have similar ids? I am afraid we need to learn such similarity from scratch. @yxgeee Please let me know if I am wrong here...

So is there a shortcut to multimodal image understanding? In their code, an image is first encoded to a continuous tensor with shape of [18, h, w], and then quantized to a binary tensor with shape [18, h, w]. We may try to firstly send the continuous tensor to LLM, as LAViT did for image understanding. Then there are 2 ways to tokenize it:

Split the continuous tensor to h*w tokens and each token has a dimension of 18.

Split the continuous tensor to 18 tokens and each token has a dimension of h*w.

To achieve higher tokenizer compression ratio, how about taking the 2nd approach?

Please add me on wechat (id: StarRingSpace) if you want to further discuss it ^_^

Best, StarCycle

Hi @StarCycle ,

"will similar images have similar ids? I am afraid we need to learn such similarity from scratch."

Although I haven't tested this myself, theoretically, it depends on whether the similarity you refer to is global or local, and whether the semantic abstraction level of each token matches. Specifically, if you are referring to global similarity, then the CLIP encoder would be a good choice. It captures high-level visual semantics, which can exclude interference from low-level disturbances, so images that are globally similar will also have similar [CLS] features. In contrast, MAGVIT2 or VQGAN are more low-level compared to CLIP. The degree of local semantics captured by each token depends on the original input image size and the downsampling rate. They might achieve similar tokens for similar semantics within their receptive field. However, unfortunately, during the training phase of an autoregressive Transformer, if only indices are used (as in the original paper), the similarity between IDs will need to be completely relearned.

"So is there a shortcut to multimodal image understanding?"

Although I don't have any good ideas at the moment, I am curious about why you would like to use the MAGVIT2 tokenizer for extracting continuous features purely for understanding tasks. Is it to save data storage space? It seems that this type of tokenizer would be more suitable for the unification of understanding and generation tasks. (BTW, it's worth mentioning that the 2nd approach sounds interesting; I hadn't thought of it before. Looking forward to your results.)

TencentARC / Open-MAGVIT2

Is it possible to also align magvit2 with a text encoder, as what you do in seed? #8