Closed MingJiaAn closed 1 year ago
有可能是系统存在破损的缓存数据,可以尝试通过删除缓存,重新生成缓存来解决。
rm -rf ~/.cache/huggingface/datasets
如果不行的话我们再看其他方案
I have encountered the same problem.
I have encountered the same problem.
I have encountered the same problem.
I have encountered the same problem.
Could you try this command?
rm -rf ~/.cache/huggingface/datasets
Thanks!
rm -rf ~/.cache/huggingface/datasets
Not Working
Traceback (most recent call last): File "/mnt/amj/LMFlow/examples/finetune.py", line 69, in
main()
File "/mnt/amj/LMFlow/examples/finetune.py", line 65, in main
tuned_model = finetuner.tune(model=model, lm_dataset=lm_dataset)
File "/mnt/amj/LMFlow/src/lmflow/pipeline/finetuner.py", line 232, in tune
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1639, in train
return inner_training_loop(
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1906, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 2652, in training_step
loss = self.compute_loss(model, inputs)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 2684, in compute_loss
outputs = model(inputs)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(args, kwargs)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
loss = self.module(*inputs, kwargs)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, *kwargs)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 936, in forward
outputs = self.model.decoder(
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(args, kwargs)
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 644, in forward
attention_mask = self._prepare_decoder_attention_mask(
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 538, in _prepare_decoder_attention_mask
combined_attention_mask = _make_causal_mask(
File "/mnt/amj/conda/envs/lmflow/lib/python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 75, in _make_causal_mask
mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions. 请问这是什么问题呢deepspeed ${deepspeed_args} examples/finetune.py --model_name_or_path facebook/galactica-1.3b --dataset_path ${dataset_path} --output_dir ${output_dir} --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 1e-4 --block_size 512 --per_device_train_batch_size 1 --use_lora 1 --lora_r 8 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune_with_lora --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --report_to none --dataloader_num_workers 1 | tee ${log_dir}/train.log 2> ${log_dir}/train.err
参数用的是默认的,GPU只有一个A100,40G显存