[BUG] <为什么在微调代码中的dataset.py 中读取conversation时要remove bos>

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

File "/home/zbl/seres/MiniCPM-V/finetune/dataset.py", line 297, in preprocess
    input_dict = conversation_to_ids(conversation, tokenizer, llm_type)
  File "/home/zbl/seres/MiniCPM-V/finetune/dataset.py", line 126, in conversation_to_ids
    ids = torch.from_numpy(np.hstack(input_ids, dtype=np.int32))
  File "/opt/conda/lib/python3.10/site-packages/numpy/core/shape_base.py", line 357, in hstack
    return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
TypeError: Cannot cast array data from dtype('float64') to dtype('int32') according to the rule 'same_kind'

修改代码 dataset.py：

prefix_ids = tokenizer.encode(prefix)[1:]  # remove bos
message_ids = tokenizer.encode(message)[1:]
print(prefix, message)
print("prefix_ids", prefix_ids, tokenizer.decode(prefix_ids))
print("message_ids", message_ids, tokenizer.decode(message_ids))

打印日志结果为下：

<用户> What is the genre of this book?
prefix_ids [20600, 29] 用户>
message_ids [374, 279, 17779, 315, 420, 2363, 30]  is the genre of this book?
<AI> History
prefix_ids [15836, 29] AI>
message_ids []
<用户> Is this book related to History?
prefix_ids [20600, 29] 用户>
message_ids [420, 2363, 5552, 311, 11346, 30]  this book related to History?
<AI> Yes.
prefix_ids [15836, 29] AI>
message_ids [13] .

由这个结果可看到 prefix_ids结果不完整 <用户> -> 用户>，message_ids 结果也不完整 Yes. -> .

期望行为 | Expected Behavior

正常训练

复现方法 | Steps To Reproduce

/MiniCPM-V/finetune# ./finetune_lora.sh

数据集为官方标准样式 [ { "id": "0", "image": 'path/to/image_0.jpg', "conversations": [ { 'role': 'user', 'content': '<image>\nHow many desserts are on the white plate?' }, { 'role': 'assistant', 'content': 'There are three desserts on the white plate.' }, { 'role': 'user', 'content': 'What type of desserts are they?' }, { 'role': 'assistant', 'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.' }, { 'role': 'user', 'content': 'What is the setting of the image?'}, { 'role': 'assistant', 'content': 'The image is set on a table top with a plate containing the three desserts.' }, ] }, ]

运行环境 | Environment

- OS:应该和版本无关
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

OpenBMB / MiniCPM-V