NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.18k stars 2.29k forks source link

merge_mp_partitions.py fails with an exception #57

Closed gcooper-isi closed 1 month ago

gcooper-isi commented 3 years ago

When I run tools/merge_mp_partitions.py, it fails with an exception:

Traceback (most recent call last):
  File "merge_mp_partitions.py", line 286, in <module>
    main()
  File "merge_mp_partitions.py", line 212, in main
    merged_model = get_model(model_type)
  File "merge_mp_partitions.py", line 125, in get_model
    model = model_provider()
  File "/data/gcooper/nlg-evaluation/Megatron-LM/pretrain_gpt2.py", line 35, in model_provider
    model = GPT2Model(num_tokentypes=0, parallel_output=True)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/gpt2_model.py", line 51, in __init__
    args.num_layers))
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 62, in get_language_model
    add_pooler=add_pooler)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 283, in __init__
    self.num_tokentypes)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 123, in __init__
    vocab_size, self.hidden_size, init_method=self.init_method)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/layers.py", line 145, in __init__
    partition_dim=0, stride=1)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/layers.py", line 58, in _initialize_affine_weight_gpu
    with get_cuda_rng_tracker().fork():
  File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/random.py", line 183, in fork
    raise Exception('cuda rng state {} is not added'.format(name))
Exception: cuda rng state model-parallel-rng is not added

When training, the RNG state gets set in initialize_megatron(), but that is not called in this case.

hejjack commented 3 years ago

Hello, has anybody solved this problem? Is there a workaround? Thanks.

Lavenderjiang commented 3 years ago

Same problem here

marscrazy commented 2 years ago

I have a similar problems.

Traceback (most recent call last): File "/data/liuguang/Sailing/tests/test_trainer_deepspeed.py", line 193, in print(model(batch)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1589, in forward loss = self.module(inputs, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/data/liuguang/Sailing/easybigmodel/model/glm_model.py", line 305, in forward model_out= self.model(input_ids, position_ids, attention_mask) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/data/liuguang/Sailing/easybigmodel/model/glm_model_mpu.py", line 122, in forward transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems, File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/data/liuguang/Sailing/easybigmodel/model/blocks/transformer_mpu.py", line 655, in forward hidden_states = layer(args, mem=mem_i) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, *kwargs) File "/data/liuguang/Sailing/easybigmodel/model/blocks/transformer_mpu.py", line 402, in forward attention_output = self.attention(layernorm_output, ltor_mask, position_embeddings, r_w_bias, r_r_bias, mem) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/layers/attentions_mpu.py", line 394, in forward with get_cuda_rng_tracker().fork(): File "/opt/conda/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 174, in fork raise Exception('cuda rng state {} is not added'.format(name)) Exception: cuda rng state model-parallel-rng is not added

Besides, what‘s with get_cuda_rng_tracker().fork(): doing?

github-actions[bot] commented 1 year ago

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

Marking as stale. No activity in 60 days.

ZhangEnmao commented 8 months ago

I got the same problem. Have you solved it ?

github-actions[bot] commented 6 months ago

Marking as stale. No activity in 60 days.

tlogn commented 5 months ago

It is caused by not initializing rng state. The code belows would work

import torch
import torch.distributed as dist
from megatron.core import mpu, tensor_parallel

dist.init_process_group()
torch.cuda.set_device(dist.get_rank())
mpu.initialize_model_parallel(xxxx)
tensor_parallel.random.model_parallel_cuda_manual_seed(xxx)
felipeliliti commented 5 months ago

há uma sugestão relacionada para expandir o script para mesclar tanto o paralelismo de tensor quanto de pipeline e também fornecer um script para dividir o checkpoint em partições separadamente2. Isso pode valer a pena investigar também.

github-actions[bot] commented 3 months ago

Marking as stale. No activity in 60 days.