AILab-CVC / SEED-X

Multimodal Models in Real World
Other
396 stars 16 forks source link

Questions about the visual tokenizer and de-tokenizer training stage #2

Closed friedrichor closed 5 months ago

friedrichor commented 5 months ago

Hello. Thanks for your excellent work!

I have a question about the training steps in section 3.1 Visual Tokenization and De-tokenization.

What is the role of the second stage for visual tokenizer and de-tokenizer training? Why do you need to use conditional images as input for fine-tuning? Does this help for model performance?

My understanding is that for SEED-X-Edit, following InstructPix2Pix, the input image is taken as the conditional image, so this training stage is undoubtedly valid for image editing task. But is this stage helpful for SEED-X-PPT, SEED-X-Story and SEED-X-Try-on? The tasks for these variants of the model don't seem to require an input image as the conditional image, so I consider that there is a gap between the second stage and the downstream tasks, which may not be helpful for model performance on these tasks.

I would appreciate it if you could answer my confusion.

geyuying commented 5 months ago

You understanding is correct.

For SEED-X-Edit and SEED-X-Try-on, we use the visual de-tokenizer after the second stage, which takes the conditional image as input to preserve the fine-grained details of the input image (Virtual Try-on also needs to preserve the details of the model image).

For SEED-X-I, SEED-X-PPT, SEED-X-Story, we use the visual de-tokenizer after the first stage without a conditional image as input.

friedrichor commented 5 months ago

Thanks for your answer.