Closed gcooper-isi closed 1 month ago
Hello, has anybody solved this problem? Is there a workaround? Thanks.
Same problem here
I have a similar problems.
Traceback (most recent call last): File "/data/liuguang/Sailing/tests/test_trainer_deepspeed.py", line 193, in
print(model(batch)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1589, in forward loss = self.module(inputs, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/data/liuguang/Sailing/easybigmodel/model/glm_model.py", line 305, in forward model_out= self.model(input_ids, position_ids, attention_mask) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/data/liuguang/Sailing/easybigmodel/model/glm_model_mpu.py", line 122, in forward transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems, File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/data/liuguang/Sailing/easybigmodel/model/blocks/transformer_mpu.py", line 655, in forward hidden_states = layer(args, mem=mem_i) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, *kwargs) File "/data/liuguang/Sailing/easybigmodel/model/blocks/transformer_mpu.py", line 402, in forward attention_output = self.attention(layernorm_output, ltor_mask, position_embeddings, r_w_bias, r_r_bias, mem) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/layers/attentions_mpu.py", line 394, in forward with get_cuda_rng_tracker().fork(): File "/opt/conda/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 174, in fork raise Exception('cuda rng state {} is not added'.format(name)) Exception: cuda rng state model-parallel-rng is not added
Besides, what‘s with get_cuda_rng_tracker().fork():
doing?
Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.
Marking as stale. No activity in 60 days.
I got the same problem. Have you solved it ?
Marking as stale. No activity in 60 days.
It is caused by not initializing rng state. The code belows would work
import torch
import torch.distributed as dist
from megatron.core import mpu, tensor_parallel
dist.init_process_group()
torch.cuda.set_device(dist.get_rank())
mpu.initialize_model_parallel(xxxx)
tensor_parallel.random.model_parallel_cuda_manual_seed(xxx)
há uma sugestão relacionada para expandir o script para mesclar tanto o paralelismo de tensor quanto de pipeline e também fornecer um script para dividir o checkpoint em partições separadamente2. Isso pode valer a pena investigar também.
Marking as stale. No activity in 60 days.
When I run tools/merge_mp_partitions.py, it fails with an exception:
When training, the RNG state gets set in initialize_megatron(), but that is not called in this case.