Coobiw / MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
349 stars 19 forks source link

special token #7

Closed PangziZhang523 closed 4 months ago

PangziZhang523 commented 9 months ago

为什么使用''这一个special token标识图像位置不需要重新训练word_embedding层和最后的lm_head输出层?

Coobiw commented 9 months ago

因为这个special token只是个占位符,最后会被q-former输出的视觉tokens替换,其实没有实际使用它的值进行计算

Coobiw commented 9 months ago

你可以参考这两处代码: https://github.com/Coobiw/MiniGPT4Qwen/blob/89829398f435963ad21fc583e3706d0ba65ed535/lavis/models/minigpt4qwen_models/minigpt4qwen.py#L228-L267https://github.com/Coobiw/MiniGPT4Qwen/blob/89829398f435963ad21fc583e3706d0ba65ed535/lavis/models/minigpt4qwen_models/minigpt4qwen.py#L294-L303

PangziZhang523 commented 9 months ago

inputs_embeds[replace_image_idxs[0],replace_image_idxs[1]] = inputs_llm.view(-1,channels).to(inputs_embeds.dtype) 这一句不是特别理解,replace_image_idxs 是占位符的index,维度是(B,L)?怎么实现替换的?

Coobiw commented 9 months ago

replace_image_idxs = torch.where(llm_tokens == self.replace_image_token_id),replace_image_idxs[0]和replace_image_idxs[1]长度分别为bs x image_tokens的长度(bs理解成2,这里就是32x2=64),所以应该两个都是(32x2=64,),分别代表占位符在batch维度(第0维)和length维度(第1维)的位置,我的理解应该是这样。你可以用pdb调试打印下shape试试,确定一下,举个简单的例子

a
tensor([[ 0,  1,  2,  3,  4,  5,  6,  7],
        [ 8,  9, 10, 11, 12, 13, 14, 15],
        [16, 17, 18, 19, 20, 21, 22, 23],
        [24, 25, 26, 27, 28, 29, 30, 31]])
b = torch.where(a>=5)
b
(tensor([0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,
        3, 3, 3]), tensor([5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4,
        5, 6, 7]))
b[0].shape
torch.Size([27])
b[1].shape
torch.Size([27])
Coobiw commented 4 months ago

solved