backward fails after updating to main branch TE

1049451037 commented 2 weeks ago

Just normal training in Megatron-LM but reports this error:

  File "/megatron/training/training.py", line 277, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/megatron/training/training.py", line 1039, in train
    train_step(forward_step_func,
  File "/megatron/training/training.py", line 552, in train_step
    losses_reduced = forward_backward_func(
  File "/megatron/core/pipeline_parallel/schedules.py", line 1423, in forward_backward_pipelining_without_interleaving
    input_tensor_grad = backward_step(
  File "/megatron/core/pipeline_parallel/schedules.py", line 299, in backward_step
    custom_backward(output_tensor[0], output_tensor_grad[0])
  File "/megatron/core/pipeline_parallel/schedules.py", line 142, in custom_backward
    Variable._execution_engine.run_backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/rmsnorm.py", line 68, in backward
    dxmat, dgamma = tex.rmsnorm_bwd(
IndexError: _Map_base::at

liliying001 commented 1 day ago

Hi~I met the same problem. have you solved it?

timmoon10 commented 1 day ago

Can you provide more information on your configuration? Megatron-LM works fine for me on 8 L40Ss when I run:

export CUDA_DEVICE_MAX_CONNECTIONS=1;

torchrun --nproc_per_node 8 \
/megatron/pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--num-layers 12 \
--hidden-size 1024 \
--num-attention-heads 64 \
--seq-length 256 \
--max-position-embeddings 2048 \
--micro-batch-size 2 \
--global-batch-size 32 \
--train-samples 512 \
--data-path /data/gpt_sample_dataset_00_text_document \
--vocab-file /data/gpt2-vocab.json \
--merge-file /data/gpt2-merges.txt \
--lr 1.0e-4 \
--transformer-impl transformer_engine \
--fp8-format hybrid \
--normalization RMSNorm

I am using the latest commits in Megatron-LM (https://github.com/NVIDIA/Megatron-LM/commit/0bc3547702464501feefeb5523b7a17e591b21fa) and Transformer Engine (https://github.com/NVIDIA/TransformerEngine/commit/67b6743204e5d40da037ca935931db2ea1a24ca7).

1049451037 commented 1 day ago

Have you tried training with sequence parallel and context parallel? I'm not sure if the problem is due to this. @timmoon10

liliying001 commented 1 day ago

I found this problem occurred after https://github.com/NVIDIA/TransformerEngine/commit/905d94f487e8ee6c03203c79e94acea6396f6142 @timmoon10

liliying001 commented 1 day ago

I found the cause of problem. @timmoon10

timmoon10 commented 13 hours ago

Good catch @liliying001. Does https://github.com/NVIDIA/TransformerEngine/pull/983 fix this issue for you guys?

1049451037 commented 2 hours ago

Yes, it's solved!

NVIDIA / TransformerEngine

backward fails after updating to main branch TE #941