OpenMOSS / AnyGPT

Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
629 stars 43 forks source link

hi,when will the train code be released? && do you train image and text tokens all in autoregressive? #8

Open Jiushanhuadao opened 3 months ago

JunZhan2000 commented 3 months ago

The first question: We have not considered releasing the training code for now, but in fact, you can use any training framework of large language model to train AnyGPT, as the model only expands the vocabulary and the prediction head. If the community deems it necessary to open source the training code, it might be planned.

The second question: Yes, it is trained only with the Next Token Prediction task, and the loss is calculated on the tokens of all modalities.

Jiushanhuadao commented 3 months ago

Thank you for your response, about the second question, I have tried this method to use VQGAN for image tokenization, the VQGAN'codbook size is 16384, I add them into the llm'codebook by using the func "add_tokens" and resize the embedding. But the convergence value of loss is relatively high, around 4.2. I concatenate the text token and image token in the image text pair to predict the next token like you say, So I guess it would be better to align the image and text tokens using the llava style of image text alignment? Just mask the input tokens and predict the remain tokens. How about your convergence value of loss? Thank you for your sincere response.

JunZhan2000 commented 3 months ago

Thank you for your response, about the second question, I have tried this method to use VQGAN for image tokenization, the VQGAN'codbook size is 16384, I add them into the llm'codebook by using the func "add_tokens" and resize the embedding. But the convergence value of loss is relatively high, around 4.2. I concatenate the text token and image token in the image text pair to predict the next token like you say, So I guess it would be better to align the image and text tokens using the llava style of image text alignment? Just mask the input tokens and predict the remain tokens. How about your convergence value of loss? Thank you for your sincere response.

I don't think where the loss is calculated makes much of a difference, as the task already includes both text-to-image and image-to-text directions. CM3Leon also used VQGAN for tokenization and it worked well (https://arxiv.org/abs/2309.02591). Perhaps you should check if something went wrong somewhere. You can start by testing in one generation direction.

unyqhz commented 2 weeks ago

Thank you for your response, about the second question, I have tried this method to use VQGAN for image tokenization, the VQGAN'codbook size is 16384, I add them into the llm'codebook by using the func "add_tokens" and resize the embedding. But the convergence value of loss is relatively high, around 4.2. I concatenate the text token and image token in the image text pair to predict the next token like you say, So I guess it would be better to align the image and text tokens using the llava style of image text alignment? Just mask the input tokens and predict the remain tokens. How about your convergence value of loss? Thank you for your sincere response.

I don't think where the loss is calculated makes much of a difference, as the task already includes both text-to-image and image-to-text directions. CM3Leon also used VQGAN for tokenization and it worked well (https://arxiv.org/abs/2309.02591). Perhaps you should check if something went wrong somewhere. You can start by testing in one generation direction.

hello, i meet the same problem! can you tell me what's the difference i should expect between one generation direction and both text-to-image and image-to-text directions? will their loss have a big difference?