Closed 1049451037 closed 2 hours ago
Hi~I met the same problem. have you solved it?
Can you provide more information on your configuration? Megatron-LM works fine for me on 8 L40Ss when I run:
export CUDA_DEVICE_MAX_CONNECTIONS=1;
torchrun --nproc_per_node 8 \
/megatron/pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--num-layers 12 \
--hidden-size 1024 \
--num-attention-heads 64 \
--seq-length 256 \
--max-position-embeddings 2048 \
--micro-batch-size 2 \
--global-batch-size 32 \
--train-samples 512 \
--data-path /data/gpt_sample_dataset_00_text_document \
--vocab-file /data/gpt2-vocab.json \
--merge-file /data/gpt2-merges.txt \
--lr 1.0e-4 \
--transformer-impl transformer_engine \
--fp8-format hybrid \
--normalization RMSNorm
I am using the latest commits in Megatron-LM (https://github.com/NVIDIA/Megatron-LM/commit/0bc3547702464501feefeb5523b7a17e591b21fa) and Transformer Engine (https://github.com/NVIDIA/TransformerEngine/commit/67b6743204e5d40da037ca935931db2ea1a24ca7).
Have you tried training with sequence parallel and context parallel? I'm not sure if the problem is due to this. @timmoon10
I found this problem occurred after https://github.com/NVIDIA/TransformerEngine/commit/905d94f487e8ee6c03203c79e94acea6396f6142 @timmoon10
I found the cause of problem. @timmoon10
Good catch @liliying001. Does https://github.com/NVIDIA/TransformerEngine/pull/983 fix this issue for you guys?
Yes, it's solved!
Just normal training in Megatron-LM but reports this error: