AILab-CVC / SEED-X

Multimodal Models in Real World
Other
372 stars 17 forks source link

Questions about number of tokens in tokenizer~ #18

Open thaoshibe opened 2 months ago

thaoshibe commented 2 months ago

Hi, congrats on your interesting work! I have a question about the mismatch in the number of tokens between the paper and the implementation.

As far as I understand while reading the paper: LLaMA has 32.000 tokens, and SEED-X added 290 new tokens for LLaMA, as details:

But as I run the code, I found out: There're actually 330 added tokens!

My questions are:

  1. What's the use of patch token?
  2. How many visual tokens (or another word, image token) are actually added into the final models? (N =100 or N= 64)
  3. Do you have any comment on the number of image token? (e.g., do you feel that more image token will be better? (e.g., N=100 is better than N=64))?

Thank you so much!

AsteriaCao commented 2 months ago

Hi, I notice in configs/clm_models/llm_seed_x_lora.yaml, special patch tokens are used to tag the beginning & end of patches, which represent the splited original images in raster-scan order. Image patches are added into multi-modal inputs to realize any resolution image generation according to the paper, which is also a novalty of SEED-X.

However, I also have the same question about 100 visual tokens. I found 64 tokens are used to represent an image, but vocab_size is 32330 actually. Maybe the extra tokens are used to explore if add more visual tokens can performance better. Looking forward to your reply!

geyuying commented 1 month ago
  1. Yes, patch tokens are used to are used to tag the beginning & end of image patches. @AsteriaCao's understanding is corret.

  2. We actually use N=64 visual tokens to represent an image, and the rest 36 visual tokens are not used during training.

  3. For the ablation study of the number of visual tokens to represent an image, you can refer to https://github.com/geyuying/SEED-X. We find that more visual tokens lead to better image reconstruction for the image de-tokenizer, but lead to worse regression for MLLM.