运行train_lora_lion.py文件的相关错误

LiuHC0428 / LAW-GPT

中文法律对话语言模型

1.06k stars 119 forks source link

运行train_lora_lion.py文件的相关错误 #14

Closed jojodan514 closed 1 year ago

jojodan514 commented 1 year ago

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.70 GiB total capacity; 5.63 GiB already allocated; 80.38 MiB free; 5.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

修改了max_split_size_mb还是不行

LiuHC0428 commented 1 year ago

请问batch_size是多少，同时显卡具体情况是什么样子？

jojodan514 commented 1 year ago

这个问题我已经解决了，现在的问题是local_rank设置为0，会报错：ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1), device(type='cuda', index=2), device(type='cuda', index=3), device(type='cuda', index=4)}.

LiuHC0428 commented 1 year ago

local_rank默认就行，不需要设置，采用的是模型并行，local_rank是数据并行留的接口

jojodan514 commented 1 year ago

好的，谢谢解答，train_lora_lion文件中的250行的evaluator没有定义

LiuHC0428 commented 1 year ago

好的感谢，训练过程中将验证过程去除了~

jojodan514 commented 1 year ago

运行还有点问题，会报错：RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

LiuHC0428 commented 1 year ago

有可能是batch size太大，或者CUDA版本和torch不匹配，也有embedding维度不匹配的问题，需要打印具体参数

jojodan514 commented 1 year ago

batch size我设置为2和1都不行，torch是1.3.1，cuda是11.7，数据集采用的是CrimeKgAssitant清洗后_52k，embedding维度这个我倒不清楚怎么看

LiuHC0428 commented 1 year ago

请提供具体的报错信息以及配置信息

jojodan514 commented 1 year ago

log.txt 配置和requirement一样

LiuHC0428 commented 1 year ago

建议您在File "/home/biiteam/Storage/YLF/LAW-GPT-main/src/peft/src/peft/tuners/lora.py", line 480, in forward result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)前面，设置一个断点看看哪里出了问题，并且可以检查一下显卡的cuda版本和安装的torch的cuda版本是否一致

hycao commented 1 year ago

@jojodan514 麻烦问下

我这边已经下载了数据集，但是还是有点疑惑？

训练时，是在已经有的法律法规模型上继续训练呢，还是可以重新训练