TsinghuaAI / CPM-2-Finetune

Finetune CPM-2
MIT License
83 stars 21 forks source link

在A100上加载FusedAdam报错 #44

Open giter000 opened 1 year ago

giter000 commented 1 year ago

您好,我尝试在2张NVIDIA A100-PCIE-40GB的卡上跑代码,直接使用了镜像环境。但是一直在加载FusedAdam时报以下错误,即使重装了apex也没解决,目前还没有找到解决办法:

Total train epochs 10 | Total train iters 286497 | building Enc-Dec model ...

number of parameters on model parallel rank 1: 5543798784 number of parameters on model parallel rank 0: 5543798784 Traceback (most recent call last): File "/mnt/finetune_cpm2.py", line 808, in main() File "/mnt/finetune_cpm2.py", line 791, in main model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size, ds_config, prompt_config) File "/mnt/utils.py", line 213, in setup_model_and_optimizer optimizer = get_optimizer(model, args, prompt_config) File "/mnt/utils.py", line 163, in get_optimizer optimizer = Adam(param_groups, File "/opt/conda/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 79, in init raise RuntimeError('apex.optimizers.FusedAdam requires cuda extensions') RuntimeError: apex.optimizers.FusedAdam requires cuda extensions

请问是否可以在2张NVIDIA A100-PCIE-40GB的卡上跑?镜像中apex环境需要调整什么吗?感谢。

t1101675 commented 1 year ago

感觉是 cuda 配置的问题,可以看下当前环境是否能使用 cuda,是否能正常跑 torch 的训练