OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.98k stars 558 forks source link

int4版本使用直接闪退,没有报错 #187

Closed luguoyixiazi closed 1 month ago

luguoyixiazi commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

使用未量化版本可以正常推理,使用int4 Python闪退,没有报错,提示Segmentation fault

复现方法 | Steps To Reproduce

model = AutoModel.from_pretrained('/path/MiniCPM-Llama3-V-2_5-int4', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('/path/MiniCPM-Llama3-V-2_5-int4', trust_remote_code=True)
model.eval()
image = Image.open('img.jpg').convert('RGB')
question = '从图片中提取出有关的身份信息,以json的格式返回'
msgs = [{'role': 'user', 'content': question}]
default_params = {"num_beams":3, "repetition_penalty": 1.2, "max_new_tokens": 1024,'temperature': 0.1,}
try:
    res = model.chat(
        image=image,
        msgs=msgs,
        tokenizer=tokenizer,
        **default_params
    )
    print(res)
except Exception as e:
    print(e)

运行环境 | Environment

- OS:wsl2
- Python:3.10
- Transformers:4.40.0
- PyTorch:2.3.0+cu118
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):cu118

备注 | Anything else?

是不是因为transformers的这个提示?Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>. low_cpu_mem_usage was None, now set to True since model is quantized.

tc-mb commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [x] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用未量化版本可以正常推理,使用int4 Python闪退,没有报错,提示Segmentation fault

复现方法 | Steps To Reproduce

model = AutoModel.from_pretrained('/path/MiniCPM-Llama3-V-2_5-int4', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('/path/MiniCPM-Llama3-V-2_5-int4', trust_remote_code=True)
model.eval()
image = Image.open('img.jpg').convert('RGB')
question = '从图片中提取出有关的身份信息,以json的格式返回'
msgs = [{'role': 'user', 'content': question}]
default_params = {"num_beams":3, "repetition_penalty": 1.2, "max_new_tokens": 1024,'temperature': 0.1,}
try:
    res = model.chat(
        image=image,
        msgs=msgs,
        tokenizer=tokenizer,
        **default_params
    )
    print(res)
except Exception as e:
    print(e)

运行环境 | Environment

- OS:wsl2
- Python:3.10
- Transformers:4.40.0
- PyTorch:2.3.0+cu118
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):cu118

备注 | Anything else?

是不是因为transformers的这个提示?Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>. low_cpu_mem_usage was None, now set to True since model is quantized.

可以在AutoModel.from_pretrained里面加入low_cpu_mem_usage=True再试下。

luguoyixiazi commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [x] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用未量化版本可以正常推理,使用int4 Python闪退,没有报错,提示Segmentation fault

复现方法 | Steps To Reproduce

model = AutoModel.from_pretrained('/path/MiniCPM-Llama3-V-2_5-int4', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('/path/MiniCPM-Llama3-V-2_5-int4', trust_remote_code=True)
model.eval()
image = Image.open('img.jpg').convert('RGB')
question = '从图片中提取出有关的身份信息,以json的格式返回'
msgs = [{'role': 'user', 'content': question}]
default_params = {"num_beams":3, "repetition_penalty": 1.2, "max_new_tokens": 1024,'temperature': 0.1,}
try:
    res = model.chat(
        image=image,
        msgs=msgs,
        tokenizer=tokenizer,
        **default_params
    )
    print(res)
except Exception as e:
    print(e)

运行环境 | Environment

- OS:wsl2
- Python:3.10
- Transformers:4.40.0
- PyTorch:2.3.0+cu118
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):cu118

备注 | Anything else?

是不是因为transformers的这个提示?Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>. low_cpu_mem_usage was None, now set to True since model is quantized.

可以在AutoModel.from_pretrained里面加入low_cpu_mem_usage=True再试下。

定位出来了,get_vllm_embedding的vision_embedding = self.vpm(all_pixel_values.type(dtype), patch_attention_mask=patch_attn_mask).last_hidden_state产生的问题,emmm

luguoyixiazi commented 1 month ago

emmm,大体上我把nvdia的驱动降级到537.58然后torch降级到2.1.2及其配套之后解决了,根据我查阅的资料,报这个错是nv的驱动和torch都有问题……