Weili-NLP / UNIMO-G

86 stars 0 forks source link

Would you like provide the details of how to use mllm text embedding #2

Open RyanHuangNLP opened 9 months ago

RyanHuangNLP commented 9 months ago

@Weili-NLP That is the great job!!!! but I did not found that how to use the mllm text embedding, is it use left padding and get the tokening embedding for the text condition embedding

Weili-NLP commented 9 months ago

Yes, text embedding is the same as GPT-based LLMs.

The architecture of our MLLM is similar to MiniGPT-4, consisting of an image encoder for encoding images and a LLM for contextual encoding. The image encoder processes images using ViT and Q-Former, subsequently employing a linear layer to convert the image embeddings into input embeddings compatible with the LLM.

trouble-maker007 commented 9 months ago

@Weili-NLP thanks for the response.

I still have some questions, is it means that while training unet, the input of MLLM is not only text, but also the image? the input format is like <img>image_path</img>text description? and get the last hidden state layer of token embedding without the padding token?

Weili-NLP commented 9 months ago

Yes. Our model input can be multimodal prompts with interleaved images and texts, just as shown in the paper.

RyanHuangNLP commented 9 months ago

@Weili-NLP Got it, mllm support only text condition generate or text and image condition generate, thanks for the answer Any really great job for whether mllm should be a good text encoder in t2i task

rdcoder33 commented 9 months ago

@Weili-NLP . Great works! Any chance we can get training code for UNIMO-G ?