InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.97k stars 310 forks source link

When seq_parallel_world_size is set to a value greater than 1, should use_varlen_attn not be set to true? #938

Open Fovercon opened 1 month ago

Fovercon commented 1 month ago

I'm working on the 32k long text SFT for Qwen2 72b. When I set seq_parallel_world_size to greater than one and use_varlen_attn to true, an error occurs. After checking, the error message is an assert error, indicating that the length of my input_ids sequence should be divisible by seq_parallel_world_size. Once I padded the sequence to the appropriate length, this error was resolved. However, after several iterations during training, the loss becomes NaN. image

Here are my specific config: use_varlen_attn = True `prompt_template = PROMPT_TEMPLATE.qwen_chat max_length = 32768 pack_to_max_length = True

parallel

sequence_parallel_size = 4

Scheduler & Optimizer

batch_size = 1 # per_device accumulative_counts = 32 accumulative_counts *= sequence_parallel_size dataloader_num_workers = 4 max_epochs = 2 optim_type = AdamW lr = 2e-6 betas = (0.9, 0.999) weight_decay = 0 max_norm = 1 # grad clip warmup_ratio = 0.1`

FlyCarrot commented 2 weeks ago

同见过这个bug, SP要求输入token长度能被整除,但是实际上没能成功对齐,最后会导致训练label出问题 xtuner/xtuner/dataset/utils.py 中,可以设置一个 参数 :drop_last 这样丢掉后面的内容就可以了。