lakshya-4gp commented 7 months ago

Describe the bug The sequence length during training is different than specified, in the configs, I've specified seq-len 50016, which is divisible by the tensor-model-parallel-size=4, however, during multinode training I'm seeing 50341 as the dimension.

lm_output = self.language_model(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 470, in forward
    encoder_input = self.embedding(enc_input_ids, enc_position_ids,

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 239, in forward
    embeddings = tensor_parallel.scatter_to_sequence_parallel_region(embeddings)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 342, in scatter_to_sequence_parallel_region
    return _ScatterToSequenceParallelRegion.apply(input_)

  File "/opt/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 239, in forward
    return _split_along_first_dim(input_)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
    dim_size % world_size == 0

  AssertionError: First dimension of the tensor should be divisible by tensor parallel size: 50341 % 4 != 0

To Reproduce run the pretrain_gpt_distributed_with_mp.sh with following args:

`GPUS_PER_NODE=8

Change for multinode config

MASTER_ADDR=10.43.176.218 MASTER_PORT=6000 NNODES=6 NODE_RANK=$1 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=./ckpts/ VOCAB_FILE=datasets/gpt2/gpt2-vocab.json MERGE_FILE=datasets/gpt2/gpt2-merges.txt DATA_PATH=datasets/gpt2/my-gpt2_text_document

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT "

GPT_ARGS=" --tensor-model-parallel-size 4 \ --pipeline-model-parallel-size 2 \ --sequence-parallel \ --num-layers 44 \ --hidden-size 1344 \ --num-attention-heads 24 \ --seq-length 50016 \ --max-position-embeddings 50016 \ --micro-batch-size 1 \ --global-batch-size 12 \ --lr 0.00015 \ --train-iters 500000 \ --lr-decay-iters 320000 \ --lr-decay-style cosine \ --min-lr 1.0e-5 \ --weight-decay 1e-2 \ --lr-warmup-fraction .01 \ --clip-grad 1.0 \ --fp16 \ --use-flash-attn "`

Expected behavior During training the input sequence length should be 50016

Stack trace/logs If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):