OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.27k stars 828 forks source link

sh run_finetune_with_lora.sh报错:RuntimeError: CUDA error: device-side assert triggered #71

Closed MingJiaAn closed 1 year ago

MingJiaAn commented 1 year ago

Traceback (most recent call last): File "/mnt/amj/LMFlow/examples/finetune.py", line 69, in main() File "/mnt/amj/LMFlow/examples/finetune.py", line 65, in main tuned_model = finetuner.tune(model=model, lm_dataset=lm_dataset) File "/mnt/amj/LMFlow/src/lmflow/pipeline/finetuner.py", line 232, in tune train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1639, in train return inner_training_loop( File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1906, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 2652, in training_step loss = self.compute_loss(model, inputs) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 2684, in compute_loss outputs = model(inputs) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(args, kwargs) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1846, in forward loss = self.module(*inputs, kwargs) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, *kwargs) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 936, in forward outputs = self.model.decoder( File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(args, kwargs) File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 644, in forward attention_mask = self._prepare_decoder_attention_mask( File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 538, in _prepare_decoder_attention_mask combined_attention_mask = _make_causal_mask( File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 75, in _make_causal_mask mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. 请问这是什么问题呢

deepspeed ${deepspeed_args} examples/finetune.py --model_name_or_path facebook/galactica-1.3b --dataset_path ${dataset_path} --output_dir ${output_dir} --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 1e-4 --block_size 512 --per_device_train_batch_size 1 --use_lora 1 --lora_r 8 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune_with_lora --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --report_to none --dataloader_num_workers 1 | tee ${log_dir}/train.log 2> ${log_dir}/train.err

参数用的是默认的,GPU只有一个A100,40G显存

shizhediao commented 1 year ago

有可能是系统存在破损的缓存数据,可以尝试通过删除缓存,重新生成缓存来解决。 rm -rf ~/.cache/huggingface/datasets 如果不行的话我们再看其他方案

hudengjunai commented 1 year ago

I have encountered the same problem.

zzcgithub commented 1 year ago

I have encountered the same problem.

sz128 commented 1 year ago

I have encountered the same problem.

sz128 commented 1 year ago

I have encountered the same problem.

shizhediao commented 1 year ago

Could you try this command? rm -rf ~/.cache/huggingface/datasets

Thanks!

LeeChongKeat commented 1 year ago

rm -rf ~/.cache/huggingface/datasets

Not Working