两张A800训练GLM-4v-9B，ValueError: Expected input batch_size (3430) to match target batch_size (232).

FanWan commented 3 months ago

System Info / 系統信息

cuda 11.8 transformers 4.42.3 peft 0.11.1 torch 2.3.1

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

1.准备数据。

2.运行官方脚本：CUDA_VISIBLE_DEVICES=3,6 python finetune_vision.py data/ /home/ssd1/luojinchang/glm-4v-9b configs/lora.yaml

3.报错： _call_impl │ │ │ │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1541 │ │ │ return forward_call(*args, **kwargs) │ │ 1542 │ │ │ │ 1543 │ │ try: │ │ 1544 │ │ │ result = None │ │ │ │ /root/anaconda3/envs/glm4/lib/python3.10/site-packages/torch/nn/modules/loss.py:1185 in forward │ │ │ │ 1182 │ │ self.label_smoothing = label_smoothing │ │ 1183 │ │ │ 1184 │ def forward(self, input: Tensor, target: Tensor) -> Tensor: │ │ ❱ 1185 │ │ return F.cross_entropy(input, target, weight=self.weight, │ │ 1186 │ │ │ │ │ │ │ ignore_index=self.ignore_index, reduction=self.reduction, │ │ 1187 │ │ │ │ │ │ │ label_smoothing=self.label_smoothing) │ │ 1188 │ │ │ │ /root/anaconda3/envs/glm4/lib/python3.10/site-packages/torch/nn/functional.py:3086 in │ │ cross_entropy │ │ │ │ 3083 │ │ ) │ │ 3084 │ if size_average is not None or reduce is not None: │ │ 3085 │ │ reduction = _Reduction.legacy_get_string(size_average, reduce) │ │ ❱ 3086 │ return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(re │ │ 3087 │ │ 3088 │ │ 3089 def binary_cross_entropy( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Expected input batch_size (3430) to match target batch_size (232). 0%| | 0/3000 [05:04<?, ?it/s]

Expected behavior / 期待表现

期望能运行成功脚本。

XuRui314 commented 3 months ago

这个是不是没有更新最新的glm4v文件导致的，你可以从huggingface上重新下载一下modeling_chatglm这个py文件

FanWan commented 3 months ago

这个是不是没有更新最新的glm4v文件导致的，你可以从huggingface上重新下载一下modeling_chatglm这个py文件

更新了最新的modeling_chatglm文件，还是报相同的错误。

zRzRzRzRzRzRzR commented 3 months ago

数据集构建错了吧，能看到是哪个代码进入的这个报错吗

EthanLeo-LYX commented 3 months ago

这个是不是没有更新最新的glm4v文件导致的，你可以从huggingface上重新下载一下modeling_chatglm这个py文件

更新了最新的modeling_chatglm文件，还是报相同的错误。

我昨天也遇到了这个问题，然后更新了modeling_chatglm.py之后就正常了

FanWan commented 3 months ago

数据集构建错了吧，能看到是哪个代码进入的这个报错吗

这是数据格式：{"messages": [{"role": "user", "content": "这只带着橙色鸟喙的鸟选中这个破裂的枝条作为栖息地的可能原因是什么？", "image": "/home/ssd1/luojin/GLM-4/finetune_demo/data/sample_images/000034000.jpg"}, {"role": "assistant", "content": "一种可能的原因是，这个带有颜色的鸟选择这个破裂的树枝作为落脚点是因为这个枝条为鸟提供了一个良好的俯瞰点。枝条的敞开和高耸位置可以为鸟提供一个清晰的环境视野，可能有助于寻找潜在的捕食者、发现食物来源或者观察附近的巢穴。此外，这个破裂的树枝可能具有让鸟感觉舒适的质地或形状，对于停歇或观察环境来说是适合的位置。"}]}

然后是在train的过程中报错的，

zRzRzRzRzRzRzR commented 2 months ago

没有遇到这个问题，你现在是什么版本的模型实现，因为最新代码肯定是不会mismatch的，glm-4v-9b

THUDM / GLM-4