NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.11k stars 1.39k forks source link

Tacotron skipping words when inference #469

Open EuphoriaCelestial opened 3 years ago

EuphoriaCelestial commented 3 years ago

Hi, I've trained my model from scratch (my own data, different language) and it works kinda good except it skipping some random word in the sentence; rarely happened but its not good at all; and it also generate random speech at the end of sentence sometime. *I only train Tacotron model, I am using pretrained Waveglow model: waveglow_256channels_universal_v5.pt

toanil315 commented 2 years ago

@EuphoriaCelestial hi, i train on vietnamese speech, after checkpoint_111k, the voice still bad (i'm training with batch size 32) How long does it take for your audio can be acceptable? My dataset: https://drive.google.com/file/d/1LNKZNkv4jk4ifnh_U3g39PzKuWz-WLAi/view?usp=sharing Can you share your dataset for me to continue training? Thanks

EuphoriaCelestial commented 2 years ago

@toanil315 I am really sorry but my dataset is not available to share, because it belong to my clients, I don't have the right to public it

toanil315 commented 2 years ago

@EuphoriaCelestial thanks for reply, it's fine, sorry for my suddenness. With this model, how much iterations for the sound to be acceptable, in your case? for me, with 179k, the voice still bad (that sounds isn't vietnamese) what batch_size and learning_rate you have trained? (im using batch_size 32 and learning_rate 3.5e-4)

EuphoriaCelestial commented 2 years ago

@toanil315 with my dataset, I stop training at 651000 iter, but I can get an acceptable result from checkpoint 132000 batch_size is depend on hardware you used to train the model, in my case it was 16 I adjust learning_rate after a period of time during training process, so its not just 1 value, but I start with 1e-3 as usual

toanil315 commented 2 years ago

@EuphoriaCelestial Hi, this is my alignment after 179k iter 28F0E273-24B0-411C-B9BE-C2B3F592FF5C Do i need to change or verify my dataset? Sorry my english is bad

EuphoriaCelestial commented 2 years ago

@toanil315 this is a million dollar question, I don't know the exact answer for it. But maybe increasing the amount of data can help

toanil315 commented 2 years ago

@EuphoriaCelestial thanks, i’ll try it, have a good day

toanil315 commented 2 years ago

@EuphoriaCelestial hi, this is my checkpoint using batch_size = 8, and result is pretty good, but when generate audio, it take a long time, im using pretrained model waveglow to generate audio and this is my checkpoint https://drive.google.com/file/d/1CnW8pPGMuAoCj9mbIazhpfM-WgU49Zsw/view?usp=sharing this is my result: https://drive.google.com/file/d/10t2USAuCENRTEV13JaeNCqcDvSofL5lZ/view?usp=sharing Do i need to continue train waveglow to increase inference speed? Thanks

EuphoriaCelestial commented 2 years ago

@toanil315 continue training will not affect inference speed by any mean it only depend on hardware power and inference code

toanil315 commented 2 years ago

@EuphoriaCelestial im using file inference.ipynb in repo tacotron2, do i need to change it? Maybe i will try to use another hardware

EuphoriaCelestial commented 2 years ago

@toanil315 I haven't tried that jupyter code, I used another one, but I believe it will not make significant difference to increase speed without upgrade the hardware, you should try something serious, like tensor RT, CUDA, change the code into C++, ...

toanil315 commented 2 years ago

@EuphoriaCelestial thanks for advise, i’ll try it. Have a good day

toanil315 commented 2 years ago

@EuphoriaCelestial hi, i have another question, when sentences have 5-10 words, it works pretty good, but with short sentence like "cấp 3", output gennerates strange language in the end of sentence. I try decrease gate_threshold = 0.1 but sometime it still happens. How can i fix it

EuphoriaCelestial commented 2 years ago

@toanil315 yea, short sentences is a big challenge for any tts system. Like you, I also can not get rid of it completely, sometimes it happens, sometimes it not. So my solution is cutting the strange audio in the end if the output audio is longer than expected

toanil315 commented 2 years ago

@toanil315 yea, short sentences is a big challenge for any tts system. Like you, I also can not get rid of it completely, sometimes it happens, sometimes it not. So my solution is cutting the strange audio in the end if the output audio is longer than expected

Can u guide me how to do it, I have no idea with this problem. How can i predict the duration of audio then cut off the part redundant

EuphoriaCelestial commented 2 years ago

after the model finished the inference process, read the audio as array or numpy, whatever you like. Check the size of the array and compare it with the length of your sentence, you will know the approximate ratio between these two, for example: sentences: "Hello world" array length: "1000" so, each character in the sentence will become 100 item in the array; after knowing this ratio, check all output audio if it go over the expected length too much, if it does, cut it at the point you think the audio have ended this method is not the best and maybe it will not cut the audio very accurate

toanil315 commented 2 years ago

after the model finished the inference process, read the audio as array or numpy, whatever you like. Check the size of the array and compare it with the length of your sentence, you will know the approximate ratio between these two, for example: sentences: "Hello world" array length: "1000" so, each character in the sentence will become 100 item in the array; after knowing this ratio, check all output audio if it go over the expected length too much, if it does, cut it at the point you think the audio have ended this method is not the best and maybe it will not cut the audio very accurate

i'll try it, thanks.

thanhlong1997 commented 2 years ago

one solution to resolve short sentence in tacotron2 is: when u training tacotron2, u need to join training duration prediction too. This addition network is simply a lstm or convolution take input as text sequence and output is the number of mel frame (that u can extract from ground truth audio output) the loss is sum of mel loss and duration loss. when inference, u need to caculate duration first and in while true loop of tacotron decoder u can estimate when to stop looping follow that duration caculated before that. Any way, i switch to non auto regressive model like fastspeech 2 when tacotron2 take too much inference time

EuphoriaCelestial commented 2 years ago

one solution to resolve short sentence in tacotron2 is: when u training tacotron2, u need to join training duration prediction too. This addition network is simply a lstm or convolution take input as text sequence and output is the number of mel frame (that u can extract from ground truth audio output) the loss is sum of mel loss and duration loss. when inference, u need to caculate duration first and in while true loop of tacotron decoder u can estimate when to stop looping follow that duration caculated before that. Any way, i switch to non auto regressive model like fastspeech 2 when tacotron2 take too much inference time

correct, but it will require more training resource (data, computing power, train time, ...) and it has to be done at the first place. In my case, I have trained Tacotron model for an extremely long time, so re-train it will be a waste and yes, fastspeech 2 is ways better, it can control duration, pitch, speed and support phonemes training, which is very good for non latin characters

toanil315 commented 2 years ago

one solution to resolve short sentence in tacotron2 is: when u training tacotron2, u need to join training duration prediction too. This addition network is simply a lstm or convolution take input as text sequence and output is the number of mel frame (that u can extract from ground truth audio output) the loss is sum of mel loss and duration loss. when inference, u need to caculate duration first and in while true loop of tacotron decoder u can estimate when to stop looping follow that duration caculated before that. Any way, i switch to non auto regressive model like fastspeech 2 when tacotron2 take too much inference time

correct, but it will require more training resource (data, computing power, train time, ...) and it has to be done at the first place. In my case, I have trained Tacotron model for an extremely long time, so re-train it will be a waste and yes, fastspeech 2 is ways better, it can control duration, pitch, speed and support phonemes training, which is very good for non latin characters

Actually this is a part of the big university exercise, and I don't have any experience in this area, maybe I will try the way of @EuphoriaCelestial, thanks for advise @thanhlong1997