[BUG] <在funetune/dataset.py中报告image start token != image end tokens的错误>

KeepFaithMe commented 2 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在执行funetune_lora.py时，出现image start token != image end tokens的错误，我将与报错相关的变量打印出来，其结果如下：image_start_tokens: tensor([], dtype=torch.int64) image_end_tokens: tensor([66]) image_start_tokens是空，它的结果来源于这一段代码：“image_start_tokens = torch.where(ids == tokenizer.im_start_id)[0]”，也就是说ids中不存在与tokenizer.im_start_id相等的值。 ids，image_start_tokens ，image_end_tokens三个变量的结果如下图： 1726658289123 数据的组织方式如下： 1726658383833

期望行为 | Expected Behavior

解决上述bug

复现方法 | Steps To Reproduce

配置好基本环境，按照图2的数据组织方式即可复现，注意，测试数据确实只包含了一张图片，但我也测试过多张图片，也不对。

运行环境 | Environment

- OS:Ubuntu 22.04
- Python:3.10
- Transformers: 4.40.0
- PyTorch:2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.8

备注 | Anything else?

No response

KeepFaithMe commented 2 months ago

补充一下，微调使用的模型是MiniCPM-V-int4

XHB-ZMM commented 2 months ago

如果你的对话中，除了第一个user有\n，同一个对话的某些user也有\n占位符，就会出现这个错误

natsoe7 commented 1 month ago

lapyae

On Wed, Oct 9, 2024, 9:23 AM qianyu chen @.***> wrote:

Closed #587 https://github.com/OpenBMB/MiniCPM-V/issues/587 as completed.

— Reply to this email directly, view it on GitHub https://github.com/OpenBMB/MiniCPM-V/issues/587#event-14565044161, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMIIOEH7TKVJCIDP56WNTATZ2SLBHAVCNFSM6AAAAABONOURHWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJUGU3DKMBUGQYTMMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

OpenBMB / MiniCPM-V