NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.44k stars 2.34k forks source link

[BUG] #743

Open lakshya-4gp opened 7 months ago

lakshya-4gp commented 7 months ago

Describe the bug The sequence length during training is different than specified, in the configs, I've specified seq-len 50016, which is divisible by the tensor-model-parallel-size=4, however, during multinode training I'm seeing 50341 as the dimension.

lm_output = self.language_model(
  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 470, in forward
    encoder_input = self.embedding(enc_input_ids, enc_position_ids,

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)

  File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/model/language_model.py", line 239, in forward
    embeddings = tensor_parallel.scatter_to_sequence_parallel_region(embeddings)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 342, in scatter_to_sequence_parallel_region
    return _ScatterToSequenceParallelRegion.apply(input_)

  File "/opt/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 239, in forward
    return _split_along_first_dim(input_)

  File "/sensei-fs/users/lakshya/video_gen/Megatron-LM/megatron/core/tensor_parallel/mappings.py", line 59, in _split_along_first_dim
    dim_size % world_size == 0

  AssertionError: First dimension of the tensor should be divisible by tensor parallel size: 50341 % 4 != 0

To Reproduce run the pretrain_gpt_distributed_with_mp.sh with following args:

`GPUS_PER_NODE=8

Change for multinode config

MASTER_ADDR=10.43.176.218 MASTER_PORT=6000 NNODES=6 NODE_RANK=$1 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=./ckpts/ VOCAB_FILE=datasets/gpt2/gpt2-vocab.json MERGE_FILE=datasets/gpt2/gpt2-merges.txt DATA_PATH=datasets/gpt2/my-gpt2_text_document

DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT "

GPT_ARGS=" --tensor-model-parallel-size 4 \ --pipeline-model-parallel-size 2 \ --sequence-parallel \ --num-layers 44 \ --hidden-size 1344 \ --num-attention-heads 24 \ --seq-length 50016 \ --max-position-embeddings 50016 \ --micro-batch-size 1 \ --global-batch-size 12 \ --lr 0.00015 \ --train-iters 500000 \ --lr-decay-iters 320000 \ --lr-decay-style cosine \ --min-lr 1.0e-5 \ --weight-decay 1e-2 \ --lr-warmup-fraction .01 \ --clip-grad 1.0 \ --fp16 \ --use-flash-attn "`

Expected behavior During training the input sequence length should be 50016

Stack trace/logs If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context Add any other context about the problem here.

github-actions[bot] commented 5 months ago

Marking as stale. No activity in 60 days.

seanliu96 commented 5 months ago

Same error here.

LiuLinyun commented 5 months ago

I have got the same error, with specified dataset, global_batch_size, and sequence_parllel on.

github-actions[bot] commented 3 months ago

Marking as stale. No activity in 60 days.

ChenQiaoling00 commented 2 months ago

Same error here.

github-actions[bot] commented 1 week ago

Marking as stale. No activity in 60 days.