FENRlR / MB-iSTFT-VITS2

Application of MB-iSTFT-VITS components to vits2_pytorch
MIT License
107 stars 27 forks source link

rank error #15

Open zhanglina94 opened 9 months ago

zhanglina94 commented 9 months ago

Hi,there

I have a question about the training of the model.

I encountered the following error in my training.

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 241, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 359, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[139681], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE).

I observed that they disappeared after 16 epochs of training. Then i try training again,
When the training reached 40 epoch, it stopped again. Why is this?

Best regards.

FENRlR commented 9 months ago

I really have no clue about reproducing this error. It seems, however, someone had already encountered situations of such before ([PS2]).

zhanglina94 commented 9 months ago

Thanks for your reply,

This error occurs when my gpu is occupied by other processes, something I haven't encountered before, and I'm not sure why it occurs~

And that blog is mine, I try to retrain him it will train again, but after training it will encounter the same problem again~~