LiuHC0428 / LAW-GPT

中文法律对话语言模型
1.06k stars 119 forks source link

运行train_lora_lion.py文件的相关错误 #14

Closed jojodan514 closed 1 year ago

jojodan514 commented 1 year ago

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.70 GiB total capacity; 5.63 GiB already allocated; 80.38 MiB free; 5.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

修改了max_split_size_mb还是不行

LiuHC0428 commented 1 year ago

请问batch_size是多少,同时显卡具体情况是什么样子?

jojodan514 commented 1 year ago

这个问题我已经解决了,现在的问题是local_rank设置为0,会报错:ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1), device(type='cuda', index=2), device(type='cuda', index=3), device(type='cuda', index=4)}.

LiuHC0428 commented 1 year ago

local_rank默认就行,不需要设置,采用的是模型并行,local_rank是数据并行留的接口

jojodan514 commented 1 year ago

好的,谢谢解答,train_lora_lion文件中的250行的evaluator没有定义

LiuHC0428 commented 1 year ago

好的感谢,训练过程中将验证过程去除了~

jojodan514 commented 1 year ago

运行还有点问题,会报错:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

LiuHC0428 commented 1 year ago

有可能是batch size太大,或者CUDA版本和torch不匹配,也有embedding维度不匹配的问题,需要打印具体参数

jojodan514 commented 1 year ago

batch size我设置为2和1都不行,torch是1.3.1,cuda是11.7,数据集采用的是CrimeKgAssitant清洗后_52k,embedding维度这个我倒不清楚怎么看

LiuHC0428 commented 1 year ago

请提供具体的报错信息以及配置信息

jojodan514 commented 1 year ago

log.txt 配置和requirement一样

LiuHC0428 commented 1 year ago

建议您在File "/home/biiteam/Storage/YLF/LAW-GPT-main/src/peft/src/peft/tuners/lora.py", line 480, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)前面,设置一个断点看看哪里出了问题,并且可以检查一下显卡的cuda版本和安装的torch的cuda版本是否一致

hycao commented 1 year ago

@jojodan514 麻烦问下

我这边已经下载了数据集,但是还是有点疑惑?

训练时,是在已经有的法律法规模型上继续训练呢,还是可以重新训练