InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.98k stars 313 forks source link

finetuning Qwen2-7B-INSTRUCT got RuntimeError: CUDA error: device-side assert triggered #814

Open dingy007 opened 4 months ago

dingy007 commented 4 months ago

config file:

`# Model pretrained_model_name_or_path = '/data/llm/cache/Qwen2-7B-Instruct/' use_varlen_attn = True

Data

data_files = ['/workspace/xtuner/sft_openai.json'] prompt_template = PROMPT_TEMPLATE.qwen_chat max_length = 32768 pack_to_max_length = True

sequence_parallel_size = 4

Scheduler & Optimizer

batch_size = 1 # per_device accumulative_counts = 128 # bs = 1 GPU 1 batch_size_per_device 16 acc accumulative_counts *= sequence_parallel_size dataloader_num_workers = 32 max_epochs = 3 optim_type = AdamW lr = 3e-4 betas = (0.9, 0.999) weight_decay = 0 max_norm = 1 # grad clip warmup_ratio = 0.01 tokenizer = dict( type=AutoTokenizer.from_pretrained, pretrained_model_name_or_path=pretrained_model_name_or_path, trust_remote_code=True, padding_side='right', eos_token='<|im_end|>')

model = dict( type=SupervisedFinetune, use_varlen_attn=use_varlen_attn, llm=dict( type=AutoModelForCausalLM.from_pretrained, pretrained_model_name_or_path=pretrained_model_name_or_path, trust_remote_code=True, torch_dtype=torch.float16, ), lora=dict( type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.1, bias='none', task_type='CAUSAL_LM')) train_dataset = dict( type=process_hf_dataset, dataset=dict(type=load_dataset, path='json', data_files=data_files), tokenizer=tokenizer, max_length=max_length, dataset_map_fn=openai_map_fn, template_map_fn=dict( type=template_map_fn_factory, template=prompt_template), remove_unused_columns=True, shuffle_before_pack=True, pack_to_max_length=pack_to_max_length, use_varlen_attn=use_varlen_attn)

train_dataloader = dict( batch_size=batch_size, num_workers=dataloader_num_workers, dataset=train_dataset, sampler=dict(type=SequenceParallelSampler, seed=1024, shuffle=True), collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn)) optim_wrapper = dict( type=AmpOptimWrapper, optimizer=dict( type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay), clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False), accumulative_counts=accumulative_counts, loss_scale='dynamic', dtype='float16') param_scheduler = [ dict( type=LinearLR, start_factor=1e-5, by_epoch=True, begin=0, end=warmup_ratio max_epochs, convert_to_iter_based=True), dict( type=CosineAnnealingLR, eta_min=0.0, by_epoch=True, begin=warmup_ratio max_epochs, end=max_epochs, convert_to_iter_based=True) ]

Error info: ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [121,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [122,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [124,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [125,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [126,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [127,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. rank3: Traceback (most recent call last): rank3: File "/workspace/xtuner/xtuner/tools/train.py", line 360, in

rank3: File "/workspace/xtuner/xtuner/tools/train.py", line 356, in main

rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1200, in train rank3: model = self.train_loop.run() # type: ignore rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 287, in run

rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 311, in run_iter rank3: outputs = self.runner.model.train_step( rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 133, in train_step rank3: losses = self._run_forward(data, mode='loss') rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 176, in _run_forward rank3: results = self.model(data, mode=mode) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank3: return self._call_impl(*args, *kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank3: return forward_call(args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn rank3: ret_val = func(*args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward rank3: loss = self.module(*inputs, *kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank3: return self._call_impl(args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank3: return forward_call(args, kwargs) rank3: File "/workspace/xtuner/xtuner/model/sft.py", line 245, in forward rank3: return self.compute_loss(data, data_samples) rank3: File "/workspace/xtuner/xtuner/model/sft.py", line 289, in compute_loss rank3: return self._compute_sequence_parallel_loss(data) rank3: File "/workspace/xtuner/xtuner/model/sft.py", line 279, in _compute_sequence_parallel_loss rank3: outputs = self.llm(data) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank3: return self._call_impl(args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank3: return forward_call(*args, *kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 1430, in forward rank3: return self.base_model( rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank3: return self._call_impl(args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank3: return forward_call(*args, *kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward rank3: return self.model.forward(args, **kwargs) `

HIT-cwh commented 4 months ago

Hi @dingy007 ! I apologize for the inconvenience this may have caused.

I’ve just tested your config and found that it runs smoothly in my environment (after replacing your sft data with Alpaca data and adjusting accumulative_counts from 128 to 4).

I would like to confirm a few things with you:

  1. Is flash_attn properly installed in your environment?
  2. Is the qwen2 model you are using identical to the one on the HF Hub? (Any modifications to the tokenizer?)
  3. The command I used for the experiment is: torchrun --nnodes=1 --nproc_per_node=8 xtuner/tools/train.py xtuner/configs/qwen/qwen2/debug.py --deepspeed deepspeed_zero1 --work-dir work_dirs/debug --launcher pytorch. Please check if our scripts are consistent.

If the issue persists, feel free to reach out for further assistance!

dingy007 commented 4 months ago

1.the flash_attn version is 2.5.9.post1, 2.do not modify the tokenizer 3.I use the command like this NPROC_PER_NODE=4 xtuner train qwen2_1_5b_chat_lora_sft.py --deepspeed deepspeed_zero2

bo-jpg commented 4 months ago

Same problem, has it been resolved? It should be caused by setting use_varlen_attn to True. Setting it to False will not report an error.

HIT-cwh commented 4 months ago

I have not been able to reproduce the error you encountered thus far. Therefore, I would kindly ask you to try using my config to see if the program runs smoothly:

  1. Check if flash_attn has been installed successfully:
    from transformers.utils.import_utils import is_flash_attn_2_available
    print(is_flash_attn_2_available())
  2. Use the latest code from the main branch of xtuner:
    git clone https://github.com/InternLM/xtuner.git
    cd xtuner
    pip install -e '.[all]'
  3. Train using the following config: NPROC_PER_NODE=4 xtuner train ${CONFIG} --deepspeed deepspeed_zero2
  4. Check the training log for the presence of the message mmengine - INFO - Dispatch Qwen2FlashAttention2 varlen forward. image

Here is my config. qwen2_7b.txt

Here is my training log. qwen2_7b_log.txt

bo-jpg commented 4 months ago

I have not been able to reproduce the error you encountered thus far. Therefore, I would kindly ask you to try using my config to see if the program runs smoothly:

  1. Check if flash_attn has been installed successfully:
from transformers.utils.import_utils import is_flash_attn_2_available
print(is_flash_attn_2_available())
  1. Use the latest code from the main branch of xtuner:
git clone https://github.com/InternLM/xtuner.git
cd xtuner
pip install -e '.[all]'
  1. Train using the following config: NPROC_PER_NODE=4 xtuner train ${CONFIG} --deepspeed deepspeed_zero2
  2. Check the training log for the presence of the message mmengine - INFO - Dispatch Qwen2FlashAttention2 varlen forward. image

Here is my config. qwen2_7b.txt

Here is my training log. qwen2_7b_log.txt

Hello, I followed your steps and used your configuration, but still got the error: RuntimeError: CUDA error: device-side assert triggered

HIT-cwh commented 4 months ago

Hi @bo-jpg @dingy007 !

This commit may potentially address the bug you encountered, and we will merge the pull request as soon as possible. We apologize for any inconvenience this may have caused in your usage.

If any issues persist, please let us know.

bo-jpg commented 4 months ago

Hi @bo-jpg @dingy007 !

This commit may potentially address the bug you encountered, and we will merge the pull request as soon as possible. We apologize for any inconvenience this may have caused in your usage.

If any issues persist, please let us know.

I modified this line of code and now the model trains normally! Thank you so much, this problem has been bothering me for several days. The code is amazing, one line of code can have such power! Thank you again for your patient answer!

bo-jpg commented 4 months ago

Hi @bo-jpg @dingy007 !

This commit may potentially address the bug you encountered, and we will merge the pull request as soon as possible. We apologize for any inconvenience this may have caused in your usage.

If any issues persist, please let us know.

Hello, I tested the model performance when use_varlen_attn was set to True. Compared with the model when use_varlen_attn was False, the performance on the test set became very poor (other parameters were the same). I wonder if there is still a problem with the attention calculation when use_varlen_attn of Qwen2 is set to True?

HIT-cwh commented 4 months ago

Please provide your training log so we can check if the loss decreased normally.

bo-jpg commented 4 months ago

Please provide your training log so we can check if the loss decreased normally.

Hello, the loss decreased normally. Here is the training log: 20240712_211628.log

dingy007 commented 4 months ago

Please provide your training log so we can check if the loss decreased normally.

Hello, the loss decreased normally. Here is the training log: 20240712_211628.log

II trained the model using the above settings of xtuner, but found that the model's performance was very poor. The same data performs normally when trained on llama-factory.

  1. I use the following training parameters

    pack_to_max_length = False
    sequence_parallel_size = 4
    # Scheduler & Optimizer
    batch_size = 1 # per_device
    accumulative_counts = 1# bs = 1 GPU * 1 batch_size_per_device * 16 acc
    accumulative_counts *= sequence_parallel_size

    The loss did not decrease significantly

  2. The number of my data sets is 198k, and the data length covers 1k-16k

  3. Can you please explain how the number of training iterations is calculated in xtuner? Because the training log shows tens of thousands of iterations. image

  4. In response to the above situation, do you have any suggestions for training parameters?