RuntimeError: seq_len <= 2048 INTERNAL ASSERT FAILED

13416157913 commented 9 months ago

Hello, this is my finetune script:(when I set --seq_length=4096,it run error, but set --seq_length=2048, it can run)--880GA800

export CUDA_DEVICE_MAX_CONNECTIONS=1 LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50" TRAIN_ARGS="--train_iters 100 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6" DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000" COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008 --hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion --no_bias_dropout_fusion --use_checkpoint_args --attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5 --layernorm_epsilon 1e-6 --weight_decay 0.1 --sequence_parallel --recompute_granularity selective --log_timers_to_tensorboard --rope_scaling_factor 1.0"

torchrun $DISTRIBUTED_ARGS finetune.py \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 1 \ --load /Megatron-LLM-sharded-weights \ --save /Megatron-LLM-sharded-weights \ --tensorboard_dir /Megatron-LLM-sharded-weights/tensorboard/ \ --data_path /Megatron-LLM/corpus_indexed/china_text_document \ --split 100,0,0 \ --model_name llama2 \ --tokenizer_type SentencePieceTokenizer \ --vocab_file=/megatron-llama-2-7b-checkpoint_TP2_PP1_DP4/tokenizer.model \ --make_vocab_size_divisible_by 1 \ --bf16 \ --global_batch_size 1000 \ --micro_batch_size 2 \ --use_checkpoint_args \ $COMMON_ARGS $LOG_ARGS $TRAIN_ARGS

====================================================================== Error:

Traceback (most recent call last): File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 261, in pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder, File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 139, in pretrain iteration = _train(args, File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 685, in _train train_step(forward_step_func, File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 412, in train_step losses_reduced = forward_backward_func( File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining output_tensor = forward_step(forward_step_func, data_iterator, File "/home/dengkaibiao/Megatron-LLM/megatron/schedules.py", line 117, in forward_step output_tensor, loss_func = forward_step_func(data_iterator, model) File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 227, in forward_step output_tensor = model(tokens, position_ids, attention_mask, File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/distributed.py", line 58, in forward return self.module(*inputs, *kwargs) File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/module.py", line 186, in forward outputs = self.module(*inputs, kwargs) File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward lm_output = self.language_model( File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/language_model.py", line 512, in forward encoder_output = self.encoder( File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward hidden_states = layer( File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 757, in forward attention_output, attention_bias = self.self_attention(layernorm_output, File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 510, in forward context_layer = self._checkpointed_attention_forward( File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 397, in _checkpointed_attention_forward hidden_states = megatron.core.tensor_parallel.checkpoint( File "/home/dengkaibiao/Megatron-LLM/megatron/core/tensor_parallel/random.py", line 251, in checkpoint return CheckpointFunction.apply(function, File "/home/dengkaibiao/Megatron-LLM/megatron/core/tensor_parallel/random.py", line 194, in forward outputs = run_function(args) File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 393, in customforward output = self.core_attention(query_layer, key_layer, File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, *kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/transformer.py", line 231, in forward attention_probs = self.scale_mask_softmax(attention_scores, File "/home/dengkaibiao/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 148, in forward return self.forward_fused_softmax(input, mask) File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 183, in forward_fused_softmax probs = ScaledUpperTriangMaskedSoftmax.apply(input, scale) File "/home/dengkaibiao/Megatron-LLM/megatron/model/fused_softmax.py", line 22, in forward softmax_results = scaled_upper_triang_masked_softmax_cuda.forward( RuntimeError: seq_len <= 2048 INTERNAL ASSERT FAILED at "/home/llm-deploy/apex/csrc/megatron/scaled_upper_triang_masked_softmax_cuda.cu":38, please report a bug to PyTorch.

martinjaggi commented 9 months ago

strange. it seems hardcoded already in the original nvidia megatron, in https://github.com/epfLLM/Megatron-LLM/blob/main/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu#L24

though we had no problem training with seq_len = 4096

@AleHD any idea?

13416157913 commented 9 months ago

strange. it seems hardcoded already in the original nvidia megatron, in

before, I use ../Megatron-LM/megatron/fuse-kernel replace ../Megatron-LLM/megatron/fuse-kernel. Because meet a error about ninja when model sharding, And I use this method to solve it.

martinjaggi commented 9 months ago

this is not compatible, as they have changed their kernels since then

please use the kernels from our repo

can you make a different issue on what happens with our kernels, if there is a problem, and close this issue?

13416157913 commented 9 months ago

this is not compatible, as they have changed their kernels since then

please use the kernels from our repo

can you make a different issue on what happens with our kernels, if there is a problem, and close this issue?

I try to use back Megatron-LLM kernel but meet the same issue.

epfLLM / Megatron-LLM

RuntimeError: seq_len <= 2048 INTERNAL ASSERT FAILED #80