`RuntimeError: CUDA error: device-side assert triggered ` when I `bash run_finetune_with_lora.sh` with the `LLAMA-7b`. The following is my script and log:

Ancientshi commented 1 year ago

          Hi, I also met the same issue when I `bash run_finetune_with_lora.sh` with the `LLAMA-7b`. The following is my script and log:

#!/bin/bash
# Please run this script under ${project_id} in project directory of

deepspeed_args="--master_port=11000"      # Default argument
if [ $# -ge 1 ]; then
  deepspeed_args="$1"
fi

exp_id=AIducation_finetune_with_lora
project_dir=$(cd "$(dirname $0)"/..; pwd)
output_dir=${project_dir}/output_models/${exp_id}
log_dir=${project_dir}/log/${exp_id}

dataset_path=${project_dir}/data/AIducation/train
model_path=/home/yunxshi/Data/workspace/LMflow/LMFlow-main/output_models/llama-7b-hf
lora_model=/home/yunxshi/Data/workspace/LMflow/LMFlow-main/output_models/llama7b-lora-380k
mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args} \
  ../examples/finetune.py \
    --model_name_or_path $model_path \
    --lora_model_path ${lora_model} \
    --dataset_path ${dataset_path} \
    --output_dir ${output_dir} --overwrite_output_dir \
    --num_train_epochs 0.01 \
    --learning_rate 1e-4 \
    --block_size 512 \
    --per_device_train_batch_size 1 \
    --use_lora 1 \
    --lora_r 8 \
    --save_aggregated_lora 0\
    --deepspeed ../configs/ds_config_zero2.json \
    --fp16 \
    --run_name finetune_with_lora \
    --validation_split_percentage 0 \
    --logging_steps 20 \
    --do_train \
    --ddp_timeout 72000 \
    --save_steps 5000 \
    --dataloader_num_workers 1 \
    | tee ${log_dir}/train.log \
    2> ${log_dir}/train.err

Rank: 0 partition count [2] and sizes[(2097152, False)] 
Rank: 1 partition count [2] and sizes[(2097152, False)] 
Using /home/yunxshi/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.007969379425048828 seconds
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [536,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/data/yunxshi/workspace/LMflow/LMFlow-main/scripts/../examples/finetune.py", line 61, in <module>
    main()
  File "/data/yunxshi/workspace/LMflow/LMFlow-main/scripts/../examples/finetune.py", line 57, in main
    tuned_model = finetuner.tune(model=model, dataset=dataset)
  File "/data/yunxshi/workspace/LMflow/LMFlow-main/src/lmflow/pipeline/finetuner.py", line 274, in tune
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/transformers/trainer.py", line 1639, in train
    return inner_training_loop(
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/transformers/trainer.py", line 1906, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/transformers/trainer.py", line 2652, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/transformers/trainer.py", line 2684, in compute_loss
    outputs = model(**inputs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/peft/peft_model.py", line 575, in forward
    return self.base_model(
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 536, in forward
    attention_mask = self._prepare_decoder_attention_mask(
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 464, in _prepare_decoder_attention_mask
    combined_attention_mask = _make_causal_mask(
  File "/home/yunxshi/anaconda3/envs/lmflow2/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 49, in _make_causal_mask
    mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Using /home/yunxshi/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0011723041534423828 seconds
[2023-05-08 09:38:56,058] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 218826
[2023-05-08 09:38:56,440] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 218827
[2023-05-08 09:38:56,440] [ERROR] [launch.py:324:sigkill_handler] ['/home/yunxshi/anaconda3/envs/lmflow2/bin/python', '-u', '../examples/finetune.py', '--local_rank=1', '--model_name_or_path', '/home/yunxshi/Data/workspace/LMflow/LMFlow-main/output_models/llama-7b-hf', '--lora_model_path', '/home/yunxshi/Data/workspace/LMflow/LMFlow-main/output_models/llama7b-lora-380k', '--dataset_path', '/home/yunxshi/Data/workspace/LMflow/LMFlow-main/data/AIducation/train', '--output_dir', '/home/yunxshi/Data/workspace/LMflow/LMFlow-main/output_models/AIducation_finetune_with_lora', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--block_size', '512', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', '../configs/ds_config_zero2.json', '--fp16', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -6

Originally posted by @Ancientshi in https://github.com/OptimalScale/LMFlow/issues/114#issuecomment-1537455013

research4pan commented 1 year ago

Thanks for your interest in LMFlow! It could be caused by RAM optimized load. You may try to add --use_ram_optimized_load 0 in the run_finetune.sh script and see if it works. Also, could you please provide your GPU type and memory size, so we may check for you if it is caused by lacking of GPU memory? Thanks 😄

Ancientshi commented 1 year ago

Hi, thanks for your reply. Maybe I have already solved this issue by clearing the dataset cache. And I have another question: when we train the model using Reward Modeling or RAFT, can we use the LoRA method? or we should (1)train on the base model to make it aligned and then (2 LoRA training on an other domain?

hendrydong commented 1 year ago

Hi, thanks for your reply. Maybe I have already solved this issue by clearing the dataset cache. And I have another question: when we train the model using Reward Modeling or RAFT, can we use the LoRA method? or we should (1)train on the base model to make it aligned and then (2 LoRA training on an other domain?

Yes. You can use LoRA in RAFT.

Shelton1013 commented 1 year ago

Hi, thanks for your reply. Maybe I have already solved this issue by clearing the dataset cache. And I have another question: when we train the model using Reward Modeling or RAFT, can we use the LoRA method? or we should (1)train on the base model to make it aligned and then (2 LoRA training on an other domain?

Hello, I encounter the same question as you. I use dataset.cleanup_cache_files() to clear the dataset cache, but I still have the question above, may I ask how do you clear your dataset cache or how do you solve the problem, many thanks.

research4pan commented 1 year ago

Hi, thanks for your reply. Maybe I have already solved this issue by clearing the dataset cache. And I have another question: when we train the model using Reward Modeling or RAFT, can we use the LoRA method? or we should (1)train on the base model to make it aligned and then (2 LoRA training on an other domain?

Hello, I encounter the same question as you. I use dataset.cleanup_cache_files() to clear the dataset cache, but I still have the question above, may I ask how do you clear your dataset cache or how do you solve the problem, many thanks.

You may clean the huggingface data cache via rm -rf ~/.cache/huggingface/datasets. For more details, please refer to this doc. Hope that answers your question. Thanks 😄

shizhediao commented 1 year ago

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

OptimalScale / LMFlow

`RuntimeError: CUDA error: device-side assert triggered ` when I `bash run_finetune_with_lora.sh` with the `LLAMA-7b`. The following is my script and log: #377