OpenBMB / CPM-Bee

百亿参数的中英文双语基座大模型
2.7k stars 214 forks source link

CPM Bee 微调时设置 half 出现 CUDA 报错,不设置 half 则 assert 报错 #79

Open YingLaiLin opened 1 year ago

YingLaiLin commented 1 year ago

CPM 使用微调脚本训练, 不开启 --use-delta 这一选项,则出现如下错误: Traceback (most recent call last): File "finetune_cpm_bee.py", line 503, in main() File "finetune_cpm_bee.py", line 499, in main finetune(args, tokenizer, model, optimizer, lr_scheduler, optim_manager) File "finetune_cpm_bee.py", line 364, in finetune optim_manager.step() File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/optim/optim_manager.py", line 131, in step optimizer.step(scale=self.loss_scale) File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(*args, *kwargs) File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(args, **kwargs) File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/optim/adam_offload.py", line 77, in step state["_grad_fp16"] = torch.empty(p.size(), dtype=torch.float16, pin_memory=True) # on host RuntimeError: CUDA error: OS call failed or operation not supported on this OS CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

YingLaiLin commented 1 year ago

CPM 微调脚本训练,不开启 --use-delta, 并且设置配置文件中 half 为 false,则出现如下错误: Traceback (most recent call last): File "finetune_cpm_bee.py", line 503, in main() File "finetune_cpm_bee.py", line 499, in main finetune(args, tokenizer, model, optimizer, lr_scheduler, optim_manager) File "finetune_cpm_bee.py", line 352, in finetune loss = loss_func(logits.view(-1, logits.size(-1)), targets.view(-1)) File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/loss/cross_entropy.py", line 192, in forward ret = OpFusedCrossEntropy.apply(input, target.int(), self.ignore_index) # return float tensor File "/ms_test2/miniconda3/envs/ms1.11/lib/python3.7/site-packages/bmtrain/loss/cross_entropy.py", line 18, in forward ignore_index, RuntimeError: input.dtype() == torch::kHalf INTERNAL ASSERT FAILED at "/tmp/pip-install-clhfk_l1/bmtrain_fe0a61bb02844d4b85067c24e12d4e87/csrc/cross_entropy_loss.cpp":25, please report a bug to PyTorch. input must be a half tensor

并且无法 /tmp 目录下找到该文件。

YingLaiLin commented 1 year ago

这个是否是由于开启 cpu offload 导致的, bmtrain 是否有相关开关控制呢?

gongbaitao commented 1 year ago

您好,这是由于此前loss_func算子仅支持半精度,现已修复