OpenBMB / CPM-Bee

百亿参数的中英文双语基座大模型
2.68k stars 211 forks source link

BMTrain 现在是否 适配CUDA 12 #55

Open MKChens opened 1 year ago

MKChens commented 1 year ago

请问 BMTrain 现在是否能够适配CUDA 12

MKChens commented 1 year ago

使用BMTrain,在 cuda12上微调模型会报错

Traceback (most recent call last): File "/home/worker/chenmingkun/github/CPM_Bee/src/finetune_cpm_bee.py", line 427, in main() File "/home/worker/chenmingkun/github/CPM_Bee/src/finetune_cpm_bee.py", line 422, in main tokenizer, model, optimizer, lr_scheduler, optim_manager = setup_model_and_optimizer(args) File "/home/worker/chenmingkun/github/CPM_Bee/src/finetune_cpm_bee.py", line 73, in setup_model_and_optimizer model = get_model(args) File "/home/worker/chenmingkun/github/CPM_Bee/src/finetune_cpm_bee.py", line 39, in get_model bmt.load(model, args.load) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/bmtrain/store.py", line 227, in load ret = model.load_state_dict( File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict load(self, state_dict) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2009, in load module._load_from_state_dict( File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/bmtrain/block_layer.py", line 532, in _load_from_state_dict for name, param in self.named_parameters(): File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2112, in named_parameters gen = self._named_members( TypeError: CheckpointBlock._named_members() got an unexpected keyword argument 'remove_duplicate' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 108098) of binary: /apps/home/worker/anaconda3/envs/CPM10/bin/python Traceback (most recent call last): File "/apps/home/worker/anaconda3/envs/CPM10/bin/torchrun", line 8, in sys.exit(main()) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/apps/home/worker/anaconda3/envs/CPM10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

MKChens commented 1 year ago

问题解决方法: 1、将pytorch包内的 site-packages/torch/version.py 内的version 修改为12.1 2、将pytorch包内的 site-packages/torch/nn/modules/module.py 第2112行 中的「, remove_duplicate=remove_duplicate」删除

xgsong commented 1 year ago

问题解决方法: 1、将pytorch包内的 site-packages/torch/version.py 内的version 修改为12.1 2、将pytorch包内的 site-packages/torch/nn/modules/module.py 第2112行 中的「, remove_duplicate=remove_duplicate」删除

硬编码的方法确实有问题。