Open Epiphqny opened 3 months ago
It seems the training is build upon official chameleon ckpt.
I think the training doc is very clear. The build dataset with (text, image_tokens) pairs and only train the output layer that output the special image tokens (4, 8196).
Dear authors,Thank you for your excellent work. I have a question regarding your training methodology, specifically concerning the utilization of training data. Upon examining the code in your GitHub repository (https://github.com/GAIR-NLP/anole/blob/219a9a3c8b2d2b67a9bcf92d341faaa16335b1fe/facilitating_image_generation/train_image_head.py#L19), I noticed that only image tokens appear to be fed into the network. Could you please confirm if my understanding is correct? If so, I'm curious about how the model learns to generate images corresponding to different text inputs?