Open RyanHuangNLP opened 9 months ago
Yes, text embedding is the same as GPT-based LLMs.
The architecture of our MLLM is similar to MiniGPT-4, consisting of an image encoder for encoding images and a LLM for contextual encoding. The image encoder processes images using ViT and Q-Former, subsequently employing a linear layer to convert the image embeddings into input embeddings compatible with the LLM.
@Weili-NLP thanks for the response.
I still have some questions, is it means that while training unet, the input of MLLM is not only text, but also the image?
the input format is like <img>image_path</img>text description
?
and get the last hidden state layer of token embedding without the padding token?
Yes. Our model input can be multimodal prompts with interleaved images and texts, just as shown in the paper.
@Weili-NLP Got it, mllm support only text condition generate or text and image condition generate, thanks for the answer Any really great job for whether mllm should be a good text encoder in t2i task
@Weili-NLP . Great works! Any chance we can get training code for UNIMO-G ?
@Weili-NLP That is the great job!!!! but I did not found that how to use the mllm text embedding, is it use left padding and get the tokening embedding for the text condition embedding