Closed MonolithFoundation closed 4 months ago
Thanks for your interest in our work. In the initial version, we use ImageNet for fair comparison with other VQ methods. It will be beneficial to get better reconstruction performance with larger training data (e.g., Open-Images and Laion dataset). We are planning to scale up the data size, and update the results in the near future.
thanks looks awesome. in the meantime would like to ask is that possible using the magvit for understanding?
---- Replied Message ---- | From | Fengyuan @.> | | Date | 06/19/2024 02:16 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [TencentARC/Open-MAGVIT2] Will consider using Web images train a more robust version? (Issue #2) |
Thanks for your interest in our work. In the current version, we use ImageNet for training and testing to compare with other VQ methods. It will be beneficial to get better reconstruction performance with larger training data (e.g., Open-Images and Lion dataset). We are planning to scale up the data size, and update the results in the future.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi @MonolithFoundation , We haven't tried it, but we anticipate some potential difficulties. The magvit2 tokenizer captures low-level semantics compared to CLIP-series encoders, as it is trained for reconstruction targets. This might result in inferior performance when used directly for understanding tasks. However, there should be solutions available, which are also crucial for properly unifying understanding and generation in the next generation of multimodal foundation models.
@yxgeee thanks for the feedback.
I also agree there should be solutions, that's also where am reaching at for unifying understanding and generation.
Currently I would be brutely give it a try, but found the magivt2 outputs just Look Up Free features from encoder, so that the features are binary (switch -1 to 1), want ask for some help for:
Hi @MonolithFoundation ,
@yxgeee thanks for the indications.
From the previous work such as SEED and LaVIT, looks like they directly concat the codebook indices with
Also, in llava, they just using -200 represent the image id, and just using the featuremap to fill the embedding actually.
So that, what could a properway to send codebook indices to LLM? like SEED does? If we send the indices into LLM, does the embedding is necessary?
@MonolithFoundation checkout Chameleon from meta code is opensourced github i have trid early fusion training: image patch (without CLIP encoder) mixed with text token, like Fuyu-8B, it's very hard to train, it's unstable when model parameter increases. Meta Chameleon on the other hand using vision tokenizer and some other tricks has proven this method have higher potential
@MonolithFoundation Yes, you can refer to SEED. The
@yxgeee From I can tell with SEED inference code, it just encode raw
Does same way applicable to magvit2?
Also, i just realise that magvit2 produces too much tokens at this moment, it's about 18x17x26 if my input resolution about 512 size.
That's for understanding is too much.
Hi, @MonolithFoundation , May I ask what size of the image you input in the tokenizer? We also release the training code for transformer training. As you can see here, https://github.com/TencentARC/Open-MAGVIT2/blob/3eaaa45d86976d27d57dbcf33465c137308ef74c/taming/models/cond_transformer.py#L99. In the stage of transformer training, each input image will be tokenized into H' W' 1, where 1 specifies the index. The 18bit will be transformed into a numerical number, as you can check https://github.com/TencentARC/Open-MAGVIT2/blob/3eaaa45d86976d27d57dbcf33465c137308ef74c/taming/modules/vqvae/lookup_free_quantize.py#L260.
Hi. my inputsize is 672 x 672.
is it big for magvit2?
---- Replied Message ---- | From | Fengyuan @.> | | Date | 07/01/2024 19:02 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [TencentARC/Open-MAGVIT2] Will consider using Web images train a more robust version? (Issue #2) |
Closed #2 as completed.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
From magvit2 original paper, they got a conssitent good result with larger web iamges training data