Open lakshya-4gp opened 7 months ago
Marking as stale. No activity in 60 days.
Same error here.
I have got the same error, with specified dataset, global_batch_size, and sequence_parllel on.
Marking as stale. No activity in 60 days.
Same error here.
Marking as stale. No activity in 60 days.
Describe the bug The sequence length during training is different than specified, in the configs, I've specified seq-len 50016, which is divisible by the tensor-model-parallel-size=4, however, during multinode training I'm seeing 50341 as the dimension.
To Reproduce run the pretrain_gpt_distributed_with_mp.sh with following args:
`GPUS_PER_NODE=8
Change for multinode config
MASTER_ADDR=10.43.176.218 MASTER_PORT=6000 NNODES=6 NODE_RANK=$1 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CHECKPOINT_PATH=./ckpts/ VOCAB_FILE=datasets/gpt2/gpt2-vocab.json MERGE_FILE=datasets/gpt2/gpt2-merges.txt DATA_PATH=datasets/gpt2/my-gpt2_text_document
DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT "
GPT_ARGS=" --tensor-model-parallel-size 4 \ --pipeline-model-parallel-size 2 \ --sequence-parallel \ --num-layers 44 \ --hidden-size 1344 \ --num-attention-heads 24 \ --seq-length 50016 \ --max-position-embeddings 50016 \ --micro-batch-size 1 \ --global-batch-size 12 \ --lr 0.00015 \ --train-iters 500000 \ --lr-decay-iters 320000 \ --lr-decay-style cosine \ --min-lr 1.0e-5 \ --weight-decay 1e-2 \ --lr-warmup-fraction .01 \ --clip-grad 1.0 \ --fp16 \ --use-flash-attn "`
Expected behavior During training the input sequence length should be 50016
Stack trace/logs If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.