[BUG] <title>LoRA 微调问题

SHIMURA0 commented 4 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在LoRA过程中遇到了image start token != image end tokens 以及 UserWarning: None of the inputs have requires_grad=True RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation 我想应该和准备的数据集有点关系，请教一下该怎么修复呢，提前感谢！

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:Ubuntu
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

qyc-98 commented 4 months ago

你的数据集目前是什么样子的

SHIMURA0 commented 4 months ago

JSON 文件里是一整个list，list里面是一个个的字典，每个字典包含“ID”， “image” 是图片的路径，“conversations” 是一个包含对话的list

qyc-98 commented 4 months ago

具体看一下conversations

SHIMURA0 commented 4 months ago

{ "role": "user", "content": "Classify the image as label 0 or 1." }, { "role": "assistant", "content": "This image is classified as label 0." },

SHIMURA0 commented 4 months ago

很奇怪的是当我在AutoModel.from_pretrained() 后面加上.to("cuda")后这个bug就没了，但是出现了cuda out of memory的bug

SHIMURA0 commented 4 months ago

同时我还想请教下另一个问题，目前我一台服务器上有8张NVIDIA GPU 显卡但是只能使用其中的7张（除了index0那张）我要修改finetune.py 这个文件以及finetune_lora.sh这个文件里面的分布式训练代码吗🤔

SHIMURA0 commented 4 months ago

还是得解决最开始的问题😂

qyc-98 commented 4 months ago

你的训练文件里面没有加这个token，请参照这个组织你的数据https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#data-preparation 不需要修改finetune.py 需要在执行脚本前设定当前可见的显卡序列请参照这个设定https://discuss.pytorch.org/t/what-does-export-cuda-visible-devices-1-really-do/90340

qyc-98 commented 4 months ago

你的显卡是什么

SHIMURA0 commented 4 months ago

好的，谢谢我先看看显卡是NVIDIA V00*8 你说的是那个 token吗，我就是按照那个准备的，但我看了后面写的那句话If you don't provide , the image will be placed at the front of the conversation，然后就把那个token去掉了，你的意思是必须要加上那个token对吧，明天实验下再次感谢

Mihaiii commented 4 months ago

I also get "RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation" when running the Lora script. I made sure I have the data in the desired format.

Attached is the output I get when running the command: output_lora.txt

FWIW, I also tried with this version of finetune.py and I get the same error I get on current main branch of the official repo.

Regarding deps, I ran the ones in requirements.txt + the pinned versions mentioned in this PR.

I use only one machine (no parallelism).

Here is the nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:46:00.0 Off |                    0 |
| N/A   32C    P0             41W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

qyc-98 commented 4 months ago

Hi, we update the code please try again.

SHIMURA0 commented 4 months ago

加上token也没用还是原来的error。

SHIMURA0 commented 4 months ago

Hi, we update the code please try again.

用了最近的代码，产生了一个新的error AttributeError: "ModulesToSaveWrapper" object has no attribute "embeddings" 错误发生在模型文件里面的modeling_minicpmv.py 脚本中的 line164，72

SHIMURA0 commented 4 months ago

Hi, we update the code please try again.

用了最近的代码，产生了一个新的error AttributeError: "ModulesToSaveWrapper" object has no attribute "embeddings" 错误发生在模型文件里面的modeling_minicpmv.py 脚本中的 line164，72

模型我用了从modelscope下载的本地缓存，然后在finetune.py文件中修改了模型的导入，改为使用modescope中的AutoModel and AutoTokenizer然后加在本地模型缓存，这会有影响吗

qyc-98 commented 4 months ago

我建议你直接拿huggingface的重新下载一遍

Mihaiii commented 4 months ago

Hi, we update the code please try again.

I can confirm it's working now with "--bf16 true --bf16_full_eval true --fp16 false --fp16_full_eval false". Initially I tried fp16 and I got an error saying "Attempting to unscale FP16 gradients" so I switched to BF16.

Thank you for the fix!

qyc-98 commented 4 months ago

You are welcome!

SHIMURA0 commented 4 months ago

嵌入的问题解决了，但是huggingface和modescope的模型难道有细微不同吗？；现在的问题是cuda out of memory 但是一些常见的解决方案我都试了，我估计是分布式训练中有点问题，如果我只想在GPU2上单卡训练应该怎么修改finetune.py的代码呢求教

SHIMURA0 commented 4 months ago

不对，我目前没有使用任何GPU，但还是报错cuda out of memory

SHIMURA0 commented 4 months ago

okok 我解决了

OpenBMB / MiniCPM-V