All training results of tacotron2 and hifiGAN is nan.

Ellie1013 commented 3 years ago

My server is... Geforce RTX 3090, Linux 20.04 CUDA 10.1 cudnn 7.6.5

The problom is...

The loading time is too long. After printing max_char_length, the loading time took about 1 hour unlike logs in the below picture.

And there was another error that is 'Value 'sm_86' is not defined for option 'gpu-name''. I found the reason is ptxas file in CUDA 10.1 bin, so I replaced that file with one in CUDA 11.2 bin. So, the CPU support instructions error happened. I don't know how fit my gpu with CUDA.

All results of tacotron2 and hifiGAN is nan. I tried many ways, but all ways failed.

CUDA_VISIBLE_DEVICES=1 python examples/tacotron2/train_tacotron2.py --train-dir ./dump_kss/train/ --dev-dir ./dump_kss/valid/ --outdir ./examples/tacotron2/exp/train.tacotron2.v1/ --config ./examples/tacotron2/conf/tacotron2.kss.v1.yaml --use-norm 1 --mixed_precision 0 --resume ""

CUDA_VISIBLE_DEVICES=0,1 python examples/tacotron2/train_tacotron2.py --train-dir ./dump_kss/train/ --dev-dir ./dump_kss/valid/ --outdir ./examples/tacotron2/exp/train.tacotron2.v1/ --config ./examples/tacotron2/conf/tacotron2.kss.v1.yaml --use-norm 1 --mixed_precision 0 --resume ./examples/tacotron2/exp/train.tacotron2.v1/checkpoints/ckpt-10000

CUDA_VISIBLE_DEVICES=0,1 python examples/tacotron2/train_tacotron2.py --train-dir ./dump_kss/train/ --dev-dir ./dump_kss/valid/ --outdir ./examples/tacotron2/exp/train.tacotron2.v1/ --config ./examples/tacotron2/conf/tacotron2.kss.v1.yaml --use-norm 1 --resume ./examples/tacotron2/exp/train.tacotron2.v1/checkpoints/ckpt-10000

dathudeptrai commented 3 years ago

@Ellie1013 all nan issue with this repo is almost caused by data. You need to check the forward step if it's a source of nan problem: (Note that you need use batch_size=1)

for data in tqdm(train_dataloader):
    output = tacotron2(**data)  # 1 step forward
    # code to check nan for output here.

Ellie1013 commented 3 years ago

@dathudeptrai This image is the results of output. Every value is nan. My dataset is wrong?

dathudeptrai commented 3 years ago

@Ellie1013 is it still ok before 10000 steps ?

Ellie1013 commented 3 years ago

@dathudeptrai No, I run it on Azure cloud(V100) before 10000 steps. Before I buy GTX 3090 server, I practiced on Azure cloud virtual machine. The result on Azure cloud didn't have any problems, so I ran same process on GTX 3090. I tried to run it in both of 0 step and 10000 steps, and they had nan. problem.

dathudeptrai commented 3 years ago

@Ellie1013 all nan issue with this repo is almost caused by data. You need to check the forward step if it's a source of nan problem: (Note that you need use batch_size=1)
for data in tqdm(train_dataloader):
    output = tacotron2(**data)  # 1 step forward
    # code to check nan for output here. 

@Ellie1013 did you check ?

Ellie1013 commented 3 years ago

I tried to run your code, but I don't know where I insert it. In the Tacotron2 code, output isn't same with the result in below image(train_stop_token_loss, ...) And every outputs are in the base_trainer, I think, so please let me know where I insert your code, then thank you

Ellie1013 commented 3 years ago

And I tried to run on Azure cloud(V100) with the data that made in GTX 3090. The result is working, so I think it is not a data problem. Maybe is there any problems when I set the learning environments?

dathudeptrai commented 3 years ago

@Ellie1013 what is ur tf version ?

Ellie1013 commented 3 years ago

@dathudeptrai tf2.2 I installed tf2.3, but there is a problem, so I changed tf2.2. The problem was below the picture. https://www.notion.so/error-daf6b74d6b404829aaf833fbb90a20e2#bc890f2f54f74524a8f22dc2df232131

NoCodeAvaible commented 3 years ago

Hey @Ellie1013, I am currently running into the same problem with my 3080:/ Did you fixed the problem?

NoCodeAvaible commented 3 years ago

@Ellie1013 is it still ok before 10000 steps ?

@dathudeptrai it's because of the evaluation:) Because when I set the evaluation interval to 100 steps the loss turns nan right after the evaluation. If I set evaluation interval to 1000 steps the loss turns immediately to nan.

Ellie1013 commented 3 years ago

Hey @Ellie1013, I am currently running into the same problem with my 3080:/ Did you fixed the problem?

Yes, I fixed it. I changed the setting to CUDA11, cudnn8, tf2.4 and python3.8, then every problems were fixed. I think the cause was that cudnn was not working, because 3080 accepted to CUDA 11

dathudeptrai commented 3 years ago

@Ellie1013 hi, I have a plan to changing the default TF version to tf2.4 but not tested yet, could you tell me if there are some bugs when using tf2.4?

Ellie1013 commented 3 years ago

@dathudeptrai Still now, it doesn't make bugs because of TF version. I run tacotron2 and hifiGAN. I have a hifiGAN exploding gradient problem after 200K, so I try to fix it. But I don't think the reason is TF version. Tacotron2 learning is working well.

CracKCatZ commented 3 years ago

Hey @Ellie1013,

did you fixed the nan loss problem? If yes could you please tell me how?

Thanks for your reply in advance:)

CracKCatZ commented 3 years ago

Hey @Ellie1013, I am currently running into the same problem with my 3080:/ Did you fixed the problem?

Yes, I fixed it. I changed the setting to CUDA11, cudnn8, tf2.4 and python3.8, then every problems were fixed. I think the cause was that cudnn was not working, because 3080 accepted to CUDA 11

Sorry missed your answer @Ellie1013 , but thank you very very much:) Fixed it for me too:)

TensorSpeech / TensorFlowTTS

All training results of tacotron2 and hifiGAN is nan. #563