NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[BUG] Example of pretraining BERT does not work #791

Open xju2 opened 2 months ago

xju2 commented 2 months ago

Describe the bug Runing the Pretraining BERT encountered two issues:

  1. the "TransformerEngine only supports softmax compute in FP32". Need to add --attention-softmax-in-fp32 to the model arguments. This applies to Pretraining GPT pretrain_gpt.sh too.
  2. The attention mask is of dimension [B, 1, max_seqlen, max_seqlen]; however, the function get_cu_seqlens expects its shape to be [B, 1, 1, max_seqlen]. The training crashes. See the log below.

To Reproduce run the example: ./examples/pretrain_bert.sh in the docker image nvcr.io/nvidia/pytorch:24.02-py3 with the main branch of Megatron-LM. The issues was found in the core_r0.6.0 branch too.

Expected behavior expect the example runs out of box.

Stack trace/logs

[after dataloaders are built] datetime: 2024-04-23 00:29:39 
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (5967.29, 5967.29)
    train/valid/test-data-iterators-setup ..........: (128.70, 128.70)
training ...
[before the start of training step] datetime: 2024-04-23 00:29:39 
torch.Size([4, 1, 512, 512])
Traceback (most recent call last):
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/pretrain_bert.py", line 194, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider,
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/training/training.py", line 270, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/training/training.py", line 990, in train
    train_step(forward_step_func,
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/training/training.py", line 541, in train_step
    losses_reduced = forward_backward_func(
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 356, in forward_backward_no_pipelining
    output_tensor = forward_step(
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 192, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/pretrain_bert.py", line 139, in forward_step
    output_tensor = model(tokens, padding_mask,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 179, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/legacy/model/module.py", line 190, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/legacy/model/bert_model.py", line 182, in forward
    lm_output = self.language_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/legacy/model/language_model.py", line 493, in forward
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/pscratch/sd/x/xju/LLMTracking/Megatron-LM/megatron/legacy/model/transformer.py", line 1777, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/transformer.py", line 625, in forward
    self_attention_outputs = self.self_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 3461, in forward
    context_layer = self.core_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 2724, in forward
    return self.fused_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 417, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 2055, in forward
    _cu_seqlens_q = get_cu_seqlens(attention_mask)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 166, in get_cu_seqlens
    cu_seqlens = torch.cat((zero, cu_seqlens))
RuntimeError: Tensors must have same number of dimensions: got 1 and 2

Environment (please complete the following information): Used the docker image: nvcr.io/nvidia/pytorch:24.02-py3.

Proposed fix N/A

Additional context N/A

abhivijay96 commented 1 month ago

Facing the same issue, please let me know if you have found a fix! Thanks!

HenryLiu0 commented 3 weeks ago

in Megatron-DeepSpeed/megatron/model/bert_model.py,there is a line:

extended_attention_mask = bert_extended_attention_mask(attention_mask)

which bert_extended_attention_mask is define like:

def bert_extended_attention_mask(attention_mask):
    # We create a 3D attention mask from a 2D tensor mask.
    # [b, 1, s]
    attention_mask_b1s = attention_mask.unsqueeze(1)
    # [b, s, 1]
    attention_mask_bs1 = attention_mask.unsqueeze(2)
    # [b, s, s]
    attention_mask_bss = attention_mask_b1s * attention_mask_bs1
    # [b, 1, s, s]
    extended_attention_mask = attention_mask_bss.unsqueeze(1)

    # Convert attention mask to binary:
    extended_attention_mask = (extended_attention_mask < 0.5)

    return extended_attention_mask

the attention_mask is extended from [b,s] to [b,1,s,s]. Is this the cause of the problem? If so, how can I fix it?

Used the docker image: nvcr.io/nvidia/pytorch:23.12-py3 Megatron-LM commit ID: c4d12e2

HenryLiu0 commented 3 weeks ago

in Megatron-DeepSpeed/megatron/model/bert_model.py,there is a line:

extended_attention_mask = bert_extended_attention_mask(attention_mask)

which bert_extended_attention_mask is define like:

def bert_extended_attention_mask(attention_mask):
    # We create a 3D attention mask from a 2D tensor mask.
    # [b, 1, s]
    attention_mask_b1s = attention_mask.unsqueeze(1)
    # [b, s, 1]
    attention_mask_bs1 = attention_mask.unsqueeze(2)
    # [b, s, s]
    attention_mask_bss = attention_mask_b1s * attention_mask_bs1
    # [b, 1, s, s]
    extended_attention_mask = attention_mask_bss.unsqueeze(1)

    # Convert attention mask to binary:
    extended_attention_mask = (extended_attention_mask < 0.5)

    return extended_attention_mask

the attention_mask is extended from [b,s] to [b,1,s,s]. Is this the cause of the problem? If so, how can I fix it?

Used the docker image: nvcr.io/nvidia/pytorch:23.12-py3 Megatron-LM commit ID: c4d12e2

use Megatron-LM branch 23.08 with docker image: nvcr.io/nvidia/pytorch:23.08-py3 can avoid this problem.