Open dingy007 opened 4 months ago
Hi @dingy007 ! I apologize for the inconvenience this may have caused.
I’ve just tested your config and found that it runs smoothly in my environment (after replacing your sft data with Alpaca data and adjusting accumulative_counts from 128 to 4).
I would like to confirm a few things with you:
torchrun --nnodes=1 --nproc_per_node=8 xtuner/tools/train.py xtuner/configs/qwen/qwen2/debug.py --deepspeed deepspeed_zero1 --work-dir work_dirs/debug --launcher pytorch
. Please check if our scripts are consistent.If the issue persists, feel free to reach out for further assistance!
1.the flash_attn version is 2.5.9.post1,
2.do not modify the tokenizer
3.I use the command like this NPROC_PER_NODE=4 xtuner train qwen2_1_5b_chat_lora_sft.py --deepspeed deepspeed_zero2
Same problem, has it been resolved? It should be caused by setting use_varlen_attn to True. Setting it to False will not report an error.
I have not been able to reproduce the error you encountered thus far. Therefore, I would kindly ask you to try using my config to see if the program runs smoothly:
from transformers.utils.import_utils import is_flash_attn_2_available
print(is_flash_attn_2_available())
git clone https://github.com/InternLM/xtuner.git
cd xtuner
pip install -e '.[all]'
NPROC_PER_NODE=4 xtuner train ${CONFIG} --deepspeed deepspeed_zero2
mmengine - INFO - Dispatch Qwen2FlashAttention2 varlen forward.
Here is my config. qwen2_7b.txt
Here is my training log. qwen2_7b_log.txt
I have not been able to reproduce the error you encountered thus far. Therefore, I would kindly ask you to try using my config to see if the program runs smoothly:
- Check if flash_attn has been installed successfully:
from transformers.utils.import_utils import is_flash_attn_2_available print(is_flash_attn_2_available())
- Use the latest code from the main branch of xtuner:
git clone https://github.com/InternLM/xtuner.git cd xtuner pip install -e '.[all]'
- Train using the following config:
NPROC_PER_NODE=4 xtuner train ${CONFIG} --deepspeed deepspeed_zero2
- Check the training log for the presence of the message
mmengine - INFO - Dispatch Qwen2FlashAttention2 varlen forward.
Here is my config. qwen2_7b.txt
Here is my training log. qwen2_7b_log.txt
Hello, I followed your steps and used your configuration, but still got the error: RuntimeError: CUDA error: device-side assert triggered
Hi @bo-jpg @dingy007 !
This commit may potentially address the bug you encountered, and we will merge the pull request as soon as possible. We apologize for any inconvenience this may have caused in your usage.
If any issues persist, please let us know.
Hi @bo-jpg @dingy007 !
This commit may potentially address the bug you encountered, and we will merge the pull request as soon as possible. We apologize for any inconvenience this may have caused in your usage.
If any issues persist, please let us know.
I modified this line of code and now the model trains normally! Thank you so much, this problem has been bothering me for several days. The code is amazing, one line of code can have such power! Thank you again for your patient answer!
Hi @bo-jpg @dingy007 !
This commit may potentially address the bug you encountered, and we will merge the pull request as soon as possible. We apologize for any inconvenience this may have caused in your usage.
If any issues persist, please let us know.
Hello, I tested the model performance when use_varlen_attn was set to True. Compared with the model when use_varlen_attn was False, the performance on the test set became very poor (other parameters were the same). I wonder if there is still a problem with the attention calculation when use_varlen_attn of Qwen2 is set to True?
Please provide your training log so we can check if the loss decreased normally.
Please provide your training log so we can check if the loss decreased normally.
Hello, the loss decreased normally. Here is the training log: 20240712_211628.log
Please provide your training log so we can check if the loss decreased normally.
Hello, the loss decreased normally. Here is the training log: 20240712_211628.log
II trained the model using the above settings of xtuner, but found that the model's performance was very poor. The same data performs normally when trained on llama-factory.
I use the following training parameters
pack_to_max_length = False
sequence_parallel_size = 4
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1# bs = 1 GPU * 1 batch_size_per_device * 16 acc
accumulative_counts *= sequence_parallel_size
The loss did not decrease significantly
The number of my data sets is 198k, and the data length covers 1k-16k
Can you please explain how the number of training iterations is calculated in xtuner? Because the training log shows tens of thousands of iterations.
In response to the above situation, do you have any suggestions for training parameters?
config file:
`# Model pretrained_model_name_or_path = '/data/llm/cache/Qwen2-7B-Instruct/' use_varlen_attn = True
Data
data_files = ['/workspace/xtuner/sft_openai.json'] prompt_template = PROMPT_TEMPLATE.qwen_chat max_length = 32768 pack_to_max_length = True
sequence_parallel_size = 4
Scheduler & Optimizer
batch_size = 1 # per_device accumulative_counts = 128 # bs = 1 GPU 1 batch_size_per_device 16 acc accumulative_counts *= sequence_parallel_size dataloader_num_workers = 32 max_epochs = 3 optim_type = AdamW lr = 3e-4 betas = (0.9, 0.999) weight_decay = 0 max_norm = 1 # grad clip warmup_ratio = 0.01 tokenizer = dict( type=AutoTokenizer.from_pretrained, pretrained_model_name_or_path=pretrained_model_name_or_path, trust_remote_code=True, padding_side='right', eos_token='<|im_end|>')
model = dict( type=SupervisedFinetune, use_varlen_attn=use_varlen_attn, llm=dict( type=AutoModelForCausalLM.from_pretrained, pretrained_model_name_or_path=pretrained_model_name_or_path, trust_remote_code=True, torch_dtype=torch.float16, ), lora=dict( type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.1, bias='none', task_type='CAUSAL_LM')) train_dataset = dict( type=process_hf_dataset, dataset=dict(type=load_dataset, path='json', data_files=data_files), tokenizer=tokenizer, max_length=max_length, dataset_map_fn=openai_map_fn, template_map_fn=dict( type=template_map_fn_factory, template=prompt_template), remove_unused_columns=True, shuffle_before_pack=True, pack_to_max_length=pack_to_max_length, use_varlen_attn=use_varlen_attn)
train_dataloader = dict( batch_size=batch_size, num_workers=dataloader_num_workers, dataset=train_dataset, sampler=dict(type=SequenceParallelSampler, seed=1024, shuffle=True), collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn)) optim_wrapper = dict( type=AmpOptimWrapper, optimizer=dict( type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay), clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False), accumulative_counts=accumulative_counts, loss_scale='dynamic', dtype='float16') param_scheduler = [ dict( type=LinearLR, start_factor=1e-5, by_epoch=True, begin=0, end=warmup_ratio max_epochs, convert_to_iter_based=True), dict( type=CosineAnnealingLR, eta_min=0.0, by_epoch=True, begin=warmup_ratio max_epochs, end=max_epochs, convert_to_iter_based=True) ]
Error info:
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [121,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [122,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [123,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [124,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [125,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [126,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [97,0,0], thread: [127,0,0] Assertion-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed. rank3: Traceback (most recent call last): rank3: File "/workspace/xtuner/xtuner/tools/train.py", line 360, inrank3: File "/workspace/xtuner/xtuner/tools/train.py", line 356, in main
rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1200, in train rank3: model = self.train_loop.run() # type: ignore rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 287, in run
rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/loops.py", line 311, in run_iter rank3: outputs = self.runner.model.train_step( rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 133, in train_step rank3: losses = self._run_forward(data, mode='loss') rank3: File "/opt/conda/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 176, in _run_forward rank3: results = self.model(data, mode=mode) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank3: return self._call_impl(*args, *kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank3: return forward_call(args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn rank3: ret_val = func(*args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward rank3: loss = self.module(*inputs, *kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank3: return self._call_impl(args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank3: return forward_call(args, kwargs) rank3: File "/workspace/xtuner/xtuner/model/sft.py", line 245, in forward rank3: return self.compute_loss(data, data_samples) rank3: File "/workspace/xtuner/xtuner/model/sft.py", line 289, in compute_loss rank3: return self._compute_sequence_parallel_loss(data) rank3: File "/workspace/xtuner/xtuner/model/sft.py", line 279, in _compute_sequence_parallel_loss rank3: outputs = self.llm(data) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank3: return self._call_impl(args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank3: return forward_call(*args, *kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 1430, in forward rank3: return self.base_model( rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank3: return self._call_impl(args, kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank3: return forward_call(*args, *kwargs) rank3: File "/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward rank3: return self.model.forward(args, **kwargs) `