OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.82k stars 543 forks source link

[BUG] <为什么在微调代码中的dataset.py 中读取conversation时要remove bos> #316

Open Single430 opened 2 days ago

Single430 commented 2 days ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

File "/home/zbl/seres/MiniCPM-V/finetune/dataset.py", line 297, in preprocess
    input_dict = conversation_to_ids(conversation, tokenizer, llm_type)
  File "/home/zbl/seres/MiniCPM-V/finetune/dataset.py", line 126, in conversation_to_ids
    ids = torch.from_numpy(np.hstack(input_ids, dtype=np.int32))
  File "/opt/conda/lib/python3.10/site-packages/numpy/core/shape_base.py", line 357, in hstack
    return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
TypeError: Cannot cast array data from dtype('float64') to dtype('int32') according to the rule 'same_kind'

修改代码 dataset.py:

prefix_ids = tokenizer.encode(prefix)[1:]  # remove bos
message_ids = tokenizer.encode(message)[1:]
print(prefix, message)
print("prefix_ids", prefix_ids, tokenizer.decode(prefix_ids))
print("message_ids", message_ids, tokenizer.decode(message_ids))

打印日志结果为下:

<用户> What is the genre of this book?
prefix_ids [20600, 29] 用户>
message_ids [374, 279, 17779, 315, 420, 2363, 30]  is the genre of this book?
<AI> History
prefix_ids [15836, 29] AI>
message_ids []
<用户> Is this book related to History?
prefix_ids [20600, 29] 用户>
message_ids [420, 2363, 5552, 311, 11346, 30]  this book related to History?
<AI> Yes.
prefix_ids [15836, 29] AI>
message_ids [13] .

由这个结果可看到 prefix_ids结果不完整 <用户> -> 用户>,message_ids 结果也不完整 Yes. -> .

期望行为 | Expected Behavior

正常训练

复现方法 | Steps To Reproduce

/MiniCPM-V/finetune# ./finetune_lora.sh

数据集为官方标准样式 [ { "id": "0", "image": 'path/to/image_0.jpg', "conversations": [ { 'role': 'user', 'content': '<image>\nHow many desserts are on the white plate?' }, { 'role': 'assistant', 'content': 'There are three desserts on the white plate.' }, { 'role': 'user', 'content': 'What type of desserts are they?' }, { 'role': 'assistant', 'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.' }, { 'role': 'user', 'content': 'What is the setting of the image?'}, { 'role': 'assistant', 'content': 'The image is set on a table top with a plate containing the three desserts.' }, ] }, ]

运行环境 | Environment

- OS:应该和版本无关
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

LDLINGLINGLING commented 17 hours ago

build target

target = torch.full_like(ids, -100, dtype=torch.int32)
for i in range(1, len(ids)):
    if context[i] == 0:
        target[i - 1] = ids[i]
    if context[i] == 1 and context[i - 1] == 0:
        if hasattr(tokenizer, "eot_id"):
            target[i - 1] = tokenizer.eot_id
        else:
            target[i - 1] = tokenizer.eos_id

根据您的描述,我定位到了126行,我不知道您是否说这里,如果是说这里的话,这里根据我的经验来看应该是构建自回归的target,因为是自回归,所以是由第一个token推导第二个token,所以最开头去掉了开始符号