[BUG] <在使用最新代码微调时遇到问题RuntimeError: Sizes of tensors must match except in dimension 1.>

Single430 commented 2 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

运行脚本：/MiniCPM-V/finetune# ./finetune_lora.sh 数据格式，标准官方提供的格式，只展示conversation：

[{'content': '<image>\nSquid are a part of what group in the diagram?Top Predators\nFilterers\nPredators\nZooplankton Please answer the question based on the options mentioned before.', 'role': 'user'}, {'content': 'Predators', 'role': 'assistant'}]

第一种正确情况，<image>所代表的图像尺寸w和h均小于448，则不会报错，因为finetune/dataset.py 函数内conversation_to_ids大概145行，image_start_tokens = [], image_end_tokens = [id]，则不会进入下面的hstack操作，

    if len(image_start_tokens) > 0:
        image_bound = torch.hstack(
            [image_start_tokens.unsqueeze(-1), image_end_tokens.unsqueeze(-1)]
        )
    else:
        image_bound = []

上面是不会报错，还有一种正常情况(图像尺寸可大于448)就是当content内容为Squid are a part of what group in the diagram?Top Predators\nFilterers\nPredators\nZooplankton Please answer the question based on the options mentioned before.\n<image>也是不会报错，因为第一种情况保证了不会hstack操作，第二种情况保证了len(image_start_tokens) == len(image_end_tokens)

下面就是报错的第三种情况，主要原因是图像尺寸w or h 大于448，要做切分操作，但是当content为<image>\nSquid are a pa...这种样式，会在conversation_to_ids_minicpm函数第181行左右，remove bos，这样会导致字首的<image>被移除，也就是message_ids，进而导致后面的image_start_tokens = [id1，id2], image_end_tokens = [id11, id22, id33]长度不匹配，进而无法进行hstack操作，具体错误如下：

File "/home/MiniCPM-V/finetune/dataset.py", line 148, in conversation_to_ids
    image_bound = torch.hstack(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 3 for tensor number 1 in the list.

最后，对conversation_to_ids_minicpm函数第181行左右进行修改如下，根据模型是自回归模型，按理说target首字需要去除，但是user的问题部分应该不需要吧？才做出以下修改，对错还需要验证：

prefix_ids = tokenizer.encode(prefix)[1:]  # remove bos
if role == "user":
    message_ids = tokenizer.encode(message)
else:
    message_ids = tokenizer.encode(message)[1:]

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

qyc-98 commented 1 month ago

运行脚本的时候 LLM_TYPE匹配吗? 并且这个修改应该是有问题的，有的tokenizer是会产生一个首字符，相当于在前还有一个字符在您的情况下，我们原本的代码默认是把那个处理掉了，所以建议您按照原来的代码执行

Single430 commented 1 month ago

运行脚本的时候 LLM_TYPE匹配吗?

并且这个修改应该是有问题的，有的tokenizer是会产生一个首字符，相当于在前还有一个字符在您的情况下，我们原本的代码默认是把那个处理掉了，所以建议您按照原来的代码执行

llm_type=minicpmv 至于你说的有的tokenizer是会产生一个首字符，我这边完全没有遇到，不知道是不是transformers版本问题？

qyc-98 commented 1 month ago

你用的是llama3的版本吗

zschanghai commented 1 day ago

@Single430 ，您好！打扰一下，看到您已经关闭了这个issue，我也遇到了该问题，想向您请教一下该问题是如何解决的

OpenBMB / MiniCPM-V