Open Single430 opened 2 days ago
target = torch.full_like(ids, -100, dtype=torch.int32)
for i in range(1, len(ids)):
if context[i] == 0:
target[i - 1] = ids[i]
if context[i] == 1 and context[i - 1] == 0:
if hasattr(tokenizer, "eot_id"):
target[i - 1] = tokenizer.eot_id
else:
target[i - 1] = tokenizer.eos_id
根据您的描述,我定位到了126行,我不知道您是否说这里,如果是说这里的话,这里根据我的经验来看应该是构建自回归的target,因为是自回归,所以是由第一个token推导第二个token,所以最开头去掉了开始符号
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
修改代码 dataset.py:
打印日志结果为下:
由这个结果可看到 prefix_ids结果不完整
<用户> -> 用户>
,message_ids 结果也不完整Yes. -> .
期望行为 | Expected Behavior
正常训练
复现方法 | Steps To Reproduce
/MiniCPM-V/finetune# ./finetune_lora.sh
数据集为官方标准样式
[ { "id": "0", "image": 'path/to/image_0.jpg', "conversations": [ { 'role': 'user', 'content': '<image>\nHow many desserts are on the white plate?' }, { 'role': 'assistant', 'content': 'There are three desserts on the white plate.' }, { 'role': 'user', 'content': 'What type of desserts are they?' }, { 'role': 'assistant', 'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.' }, { 'role': 'user', 'content': 'What is the setting of the image?'}, { 'role': 'assistant', 'content': 'The image is set on a table top with a plate containing the three desserts.' }, ] }, ]
运行环境 | Environment
备注 | Anything else?
No response