Open YingLaiLin opened 1 year ago
CPM 微调脚本训练,不开启 --use-delta, 并且设置配置文件中 half 为 false,则出现如下错误:
Traceback (most recent call last):
File "finetune_cpm_bee.py", line 503, in
并且无法 /tmp 目录下找到该文件。
这个是否是由于开启 cpu offload 导致的, bmtrain 是否有相关开关控制呢?
您好,这是由于此前loss_func算子仅支持半精度,现已修复
CPM 使用微调脚本训练, 不开启 --use-delta 这一选项,则出现如下错误: Traceback (most recent call last): File "finetune_cpm_bee.py", line 503, in
main()
File "finetune_cpm_bee.py", line 499, in main
finetune(args, tokenizer, model, optimizer, lr_scheduler, optim_manager)
File "finetune_cpm_bee.py", line 364, in finetune
optim_manager.step()
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/optim/optim_manager.py", line 131, in step
optimizer.step(scale=self.loss_scale)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, *kwargs)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, **kwargs)
File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/optim/adam_offload.py", line 77, in step
state["_grad_fp16"] = torch.empty(p.size(), dtype=torch.float16, pin_memory=True) # on host
RuntimeError: CUDA error: OS call failed or operation not supported on this OS
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.