Will consider using Web images train a more robust version?

MonolithFoundation commented 4 months ago

From magvit2 original paper, they got a conssitent good result with larger web iamges training data

ShiFengyuan1999 commented 4 months ago

Thanks for your interest in our work. In the initial version, we use ImageNet for fair comparison with other VQ methods. It will be beneficial to get better reconstruction performance with larger training data (e.g., Open-Images and Laion dataset). We are planning to scale up the data size, and update the results in the near future.

MonolithFoundation commented 4 months ago

thanks looks awesome. in the meantime would like to ask is that possible using the magvit for understanding?

---- Replied Message ---- | From | Fengyuan @.> | | Date | 06/19/2024 02:16 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [TencentARC/Open-MAGVIT2] Will consider using Web images train a more robust version? (Issue #2) |

Thanks for your interest in our work. In the current version, we use ImageNet for training and testing to compare with other VQ methods. It will be beneficial to get better reconstruction performance with larger training data (e.g., Open-Images and Lion dataset). We are planning to scale up the data size, and update the results in the future.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

yxgeee commented 4 months ago

Hi @MonolithFoundation , We haven't tried it, but we anticipate some potential difficulties. The magvit2 tokenizer captures low-level semantics compared to CLIP-series encoders, as it is trained for reconstruction targets. This might result in inferior performance when used directly for understanding tasks. However, there should be solutions available, which are also crucial for properly unifying understanding and generation in the next generation of multimodal foundation models.

MonolithFoundation commented 4 months ago

@yxgeee thanks for the feedback.

I also agree there should be solutions, that's also where am reaching at for unifying understanding and generation.

Currently I would be brutely give it a try, but found the magivt2 outputs just Look Up Free features from encoder, so that the features are binary (switch -1 to 1), want ask for some help for:

how to deal with it, what could be the properly way to feed it into LLM's conditional embed feature?
What kinds of else way it could be for the making use of magvit2 - currently best image tokenizer?

yxgeee commented 4 months ago

Hi @MonolithFoundation ,

As done in their original paper, the binary features produced by the MAGVIT2 tokenizer are only used to provide token indices and the embedding of each index needs to be further learned together with LLM (known as the visual vocabulary). Specifically, each 18-dim binary feature can represent an index range from [0, $2^{18}$). If there are $16 \times 16$ tokens for one image, 256 tokens would be used to represent it in LLM where their indices are produced by the MAGVIT2 encoder and the corresponding vocabulary embeddings are learned in LLM.
We are also exploring it. We can discuss it further if we have better solutions.

MonolithFoundation commented 4 months ago

@yxgeee thanks for the indications.

From the previous work such as SEED and LaVIT, looks like they directly concat the codebook indices with .. etc and embed it in sentences directly, without adding embedding condition.

Also, in llava, they just using -200 represent the image id, and just using the featuremap to fill the embedding actually.

So that, what could a properway to send codebook indices to LLM? like SEED does? If we send the indices into LLM, does the embedding is necessary?

eisneim commented 4 months ago

@MonolithFoundation checkout Chameleon from meta code is opensourced github i have trid early fusion training: image patch (without CLIP encoder) mixed with text token, like Fuyu-8B, it's very hard to train, it's unstable when model parameter increases. Meta Chameleon on the other hand using vision tokenizer and some other tricks has proven this method have higher potential

yxgeee commented 4 months ago

@MonolithFoundation Yes, you can refer to SEED. The ... are exactly the indices I mentioned. The difference is that SEED uses 8192 visual vocabulary while MAGVIT2 uses $2^{18}$. You can also refer to Chameleon's code (mentioned by @eisneim ). I think it should be similar. However, SEED provides full training code while Chameleon not.

MonolithFoundation commented 4 months ago

@yxgeee From I can tell with SEED inference code, it just encode raw .. with text tokenizer, and without conditional embedding feed into.

Does same way applicable to magvit2?

Also, i just realise that magvit2 produces too much tokens at this moment, it's about 18x17x26 if my input resolution about 512 size.

That's for understanding is too much.

RobertLuo1 commented 4 months ago

Hi, @MonolithFoundation , May I ask what size of the image you input in the tokenizer? We also release the training code for transformer training. As you can see here, https://github.com/TencentARC/Open-MAGVIT2/blob/3eaaa45d86976d27d57dbcf33465c137308ef74c/taming/models/cond_transformer.py#L99. In the stage of transformer training, each input image will be tokenized into H' W' 1, where 1 specifies the index. The 18bit will be transformed into a numerical number, as you can check https://github.com/TencentARC/Open-MAGVIT2/blob/3eaaa45d86976d27d57dbcf33465c137308ef74c/taming/modules/vqvae/lookup_free_quantize.py#L260.

MonolithFoundation commented 4 months ago

Hi. my inputsize is 672 x 672.

is it big for magvit2?

---- Replied Message ---- | From | Fengyuan @.> | | Date | 07/01/2024 19:02 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [TencentARC/Open-MAGVIT2] Will consider using Web images train a more robust version? (Issue #2) |

Closed #2 as completed.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

TencentARC / Open-MAGVIT2

Will consider using Web images train a more robust version? #2