OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.82k stars 543 forks source link

[BUG] get_vllm_embedding中的patch_attn_mask计算有问题 #274

Open lihua8848 opened 2 weeks ago

lihua8848 commented 2 weeks ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

    def get_vllm_embedding(self, data):
        if 'vision_hidden_states' not in data:
            dtype = self.vpm.embeddings.position_embedding.weight.dtype
            device = self.vpm.embeddings.position_embedding.weight.device
            tgt_sizes = data['tgt_sizes']
            pixel_values_list = data['pixel_values']
            best_grid = data["best_grid"]
            vision_hidden_states = []
            all_pixel_values = []
            img_cnt = []
            for pixel_values in pixel_values_list:
                img_cnt.append(len(pixel_values))
                all_pixel_values.extend([i.flatten(end_dim=1).permute(1, 0) for i in pixel_values])

            # exist image
            if all_pixel_values:
                tgt_sizes = torch.vstack(tgt_sizes).type(torch.int32)

                if self.config.batch_vision_input:
                    max_patches = torch.max(tgt_sizes[:, 0] * tgt_sizes[:, 1])

                    all_pixel_values = torch.nn.utils.rnn.pad_sequence(all_pixel_values, batch_first=True,
                                                                       padding_value=0.0)
                    B, L, _ = all_pixel_values.shape
                    all_pixel_values = all_pixel_values.permute(0, 2, 1).reshape(B, 3, -1, L)

                    patch_attn_mask = torch.zeros((B, 1, max_patches), dtype=torch.bool, device=device)
                    for i in range(B):
                        patch_attn_mask[i, :tgt_sizes[i][0] * tgt_sizes[i][1]] = True

patch_attn_mask计算出现问题,索引出错,导致patch_attn_mask全为true image

image 上图的i=4时,有17个padding,应当最后17个为False,但patch_attn_mask最后的结果全为True

https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/main/modeling_minicpmv.py#L97

期望行为 | Expected Behavior

patch_attn_mask[i, :tgt_sizes[i][0] * tgt_sizes[i][1]] = True

修改为

patch_attn_mask[i, 0,:tgt_sizes[i][0] * tgt_sizes[i][1]] = True

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:ubuntu 20.04
- Python: Python 3.10.14
- Transformers: 4.40.0
- PyTorch:2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

No response

iceflame89 commented 2 weeks ago

感谢反馈,我们正在评估影响

YuzaChongyi commented 2 weeks ago

你好,这确实是一个 mistake,感谢反馈,为了保证训练和推理的一致性,我们不直接修改 hf 上的代码了,我们会在后续的模型发布中系统性地修复这个问题

whyiug commented 2 weeks ago

你好,这确实是一个 mistake,感谢反馈,为了保证训练和推理的一致性,我们不直接修改 hf 上的代码了,我们会在后续的模型发布中系统性地修复这个问题 @YuzaChongyi Can you fully assess the impact? We are already fine-tuning the model and applying it to production. Or, when the next model will be released?

YuzaChongyi commented 2 weeks ago

你好,这确实是一个 mistake,感谢反馈,为了保证训练和推理的一致性,我们不直接修改 hf 上的代码了,我们会在后续的模型发布中系统性地修复这个问题 @YuzaChongyi Can you fully assess the impact? We are already fine-tuning the model and applying it to production. Or, when the next model will be released?

There is no problem if the behavior of patch_attn_mask is consistent during the training and inference. We also try to modify it directly, which basically does not change the inference results. This version will not be updated to keep the evaluation results reproducible.

The release date of the next model is not certain yet,we are working for it.