Closed SHIMURA0 closed 4 months ago
你的数据集目前是什么样子的
JSON 文件里是一整个list,list里面是一个个的字典,每个字典包含“ID”, “image” 是图片的路径,“conversations” 是一个包含对话的list
具体看一下conversations
{ "role": "user", "content": "Classify the image as label 0 or 1." }, { "role": "assistant", "content": "This image is classified as label 0." },
很奇怪的是当我在AutoModel.from_pretrained() 后面加上.to("cuda")后这个bug就没了,但是出现了cuda out of memory的bug
同时我还想请教下另一个问题,目前我一台服务器上有8张NVIDIA GPU 显卡但是只能使用其中的7张(除了index0那张)我要修改finetune.py 这个文件以及finetune_lora.sh这个文件里面的分布式训练代码吗🤔
还是得解决最开始的问题😂
你的训练文件里面没有加
你的显卡是什么
好的,谢谢我先看看
显卡是NVIDIA V00*8
你说的是那个
I also get "RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation" when running the Lora script. I made sure I have the data in the desired format.
Attached is the output I get when running the command: output_lora.txt
FWIW, I also tried with this version of finetune.py and I get the same error I get on current main branch of the official repo.
Regarding deps, I ran the ones in requirements.txt + the pinned versions mentioned in this PR.
I use only one machine (no parallelism).
Here is the nvidia-smi
output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:46:00.0 Off | 0 |
| N/A 32C P0 41W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Hi, we update the code please try again.
加上token也没用还是原来的error。
Hi, we update the code please try again.
用了最近的代码,产生了一个新的error AttributeError: "ModulesToSaveWrapper" object has no attribute "embeddings" 错误发生在模型文件里面的modeling_minicpmv.py 脚本中的 line164,72
Hi, we update the code please try again.
用了最近的代码,产生了一个新的error AttributeError: "ModulesToSaveWrapper" object has no attribute "embeddings" 错误发生在模型文件里面的modeling_minicpmv.py 脚本中的 line164,72
模型我用了从modelscope下载的本地缓存,然后在finetune.py文件中修改了模型的导入,改为使用modescope中的AutoModel and AutoTokenizer然后加在本地模型缓存,这会有影响吗
我建议你直接拿huggingface的重新下载一遍
Hi, we update the code please try again.
I can confirm it's working now with "--bf16 true --bf16_full_eval true --fp16 false --fp16_full_eval false". Initially I tried fp16 and I got an error saying "Attempting to unscale FP16 gradients" so I switched to BF16.
Thank you for the fix!
You are welcome!
嵌入的问题解决了,但是huggingface和modescope的模型难道有细微不同吗?; 现在的问题是cuda out of memory 但是一些常见的解决方案我都试了,我估计是分布式训练中有点问题,如果我只想在GPU2上单卡训练应该怎么修改finetune.py的代码呢求教
不对,我目前没有使用任何GPU,但还是报错cuda out of memory
okok 我解决了
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
在LoRA过程中遇到了image start token != image end tokens 以及 UserWarning: None of the inputs have requires_grad=True RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation 我想应该和准备的数据集有点关系,请教一下该怎么修复呢, 提前感谢!
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
备注 | Anything else?
No response