GAIR-NLP / anole

Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
https://huggingface.co/spaces/ethanchern/Anole
676 stars 36 forks source link

Question about the training data #37

Open Epiphqny opened 3 months ago

Epiphqny commented 3 months ago

Dear authors,Thank you for your excellent work. I have a question regarding your training methodology, specifically concerning the utilization of training data. Upon examining the code in your GitHub repository (https://github.com/GAIR-NLP/anole/blob/219a9a3c8b2d2b67a9bcf92d341faaa16335b1fe/facilitating_image_generation/train_image_head.py#L19), I noticed that only image tokens appear to be fed into the network. Could you please confirm if my understanding is correct? If so, I'm curious about how the model learns to generate images corresponding to different text inputs?

irexyc commented 2 months ago

It seems the training is build upon official chameleon ckpt.

I think the training doc is very clear. The build dataset with (text, image_tokens) pairs and only train the output layer that output the special image tokens (4, 8196).