Audio always max length

c8h10n4o2ed commented 4 years ago

I have successfully trained this model on a custom non-English dataset with 8 GPUs for 48 hours and the dataset is about the same size as LJSpeech and all the speech alignments are really good (almost perfect diagonal lines) yet the utterances synthesized are too long and the later audio alignments after the good audio are terrible (multiple horizontal lines, random patterns, etc...) after the initial audio has completed. If I edit the output audio to fit the correct length then the outputs are awesome.
I have included punctuation in all the training and test data as well as applying only the translliteration text filter.
Is there any way to guarantee that the output gating mechanism will actually truncate the synthesis process properly at the correct point? I'll see if I can post some TensorBoard output if I still have it available.
- The gate outputs are horizontal lines during training. Adding additional TensorBoard debugging seems to really slow down the code, so I don't have many screenshots to post (and can't at the moment either way).
Gate output from TensorFlow during a retraining session:

xDuck commented 4 years ago

Do your audio samples have silence at the end? I usually add about .3s of silence to the end of my files to combat that. Too much and it also won't stop.

A stop token (i.e ~) also helps

EuphoriaCelestial commented 3 years ago

Do your audio samples have silence at the end? I usually add about .3s of silence to the end of my files to combat that. Too much and it also won't stop.

A stop token (i.e ~) also helps

can you give more detail about how you do it? I also facing this problem. Sometime the audio end at right time, sometime it have a long silence at the end, another time it speak some "alien language" after finished the input sentence I tried reduce gate_threshold but it only reduce the error rate, not completely solved the problem

lucashueda commented 3 years ago

Do your audio samples have silence at the end? I usually add about .3s of silence to the end of my files to combat that. Too much and it also won't stop. A stop token (i.e ~) also helps

can you give more detail about how you do it? I also facing this problem. Sometime the audio end at right time, sometime it have a long silence at the end, another time it speak some "alien language" after finished the input sentence I tried reduce gate_threshold but it only reduce the error rate, not completely solved the problem

In my tests the silence in the end makes a big difference, the model starts to learn when to stop. I did it normalizing silence with librosa trim effect and then concatenating 3*hop_length of silence (0's vector), also the stop token you just need to preprocess your tokens to make this:

If your phrase is "my name is Lucas." you process it to be " my name is Lucas ", where is a token to indicates that the sentences is starting and where the sentence ends.

lucashueda commented 3 years ago

Oh man the tokens doesn't appear.

"SOS my name is Lucas EOS"

SOS is the start of sequence token and EOS is the end of sequence

EuphoriaCelestial commented 3 years ago

Oh man the tokens doesn't appear.

"SOS my name is Lucas EOS"

SOS is the start of sequence token and EOS is the end of sequence

what kind of token can I use? a special symbol? can I do this: "~ this is a sentence ." do they need to be different or I can use 1 symbol for both SOS and EOS?

EuphoriaCelestial commented 3 years ago

Do your audio samples have silence at the end? I usually add about .3s of silence to the end of my files to combat that. Too much and it also won't stop. A stop token (i.e ~) also helps

can you give more detail about how you do it? I also facing this problem. Sometime the audio end at right time, sometime it have a long silence at the end, another time it speak some "alien language" after finished the input sentence I tried reduce gate_threshold but it only reduce the error rate, not completely solved the problem

In my tests the silence in the end makes a big difference, the model starts to learn when to stop. I did it normalizing silence with librosa trim effect and then concatenating 3*hop_length of silence (0's vector), also the stop token you just need to preprocess your tokens to make this:

If your phrase is "my name is Lucas." you process it to be " my name is Lucas ", where is a token to indicates that the sentences is starting and where the sentence ends.

I am not clear how long of the silence I should add (in seconds) and how to do it. My model using the default value of hop_length=256 can I read all my audio data as numpy array, add zeros in the end and rewrite file?

lucashueda commented 3 years ago

Oh man the tokens doesn't appear. "SOS my name is Lucas EOS" SOS is the start of sequence token and EOS is the end of sequence

what kind of token can I use? a special symbol? can I do this: "~ this is a sentence ." do they need to be different or I can use 1 symbol for both SOS and EOS?

I didn't make ablations on the token itself, so by now im using the default text processing of this repository and it is working.

lucashueda commented 3 years ago

Do your audio samples have silence at the end? I usually add about .3s of silence to the end of my files to combat that. Too much and it also won't stop. A stop token (i.e ~) also helps

can you give more detail about how you do it? I also facing this problem. Sometime the audio end at right time, sometime it have a long silence at the end, another time it speak some "alien language" after finished the input sentence I tried reduce gate_threshold but it only reduce the error rate, not completely solved the problem

In my tests the silence in the end makes a big difference, the model starts to learn when to stop. I did it normalizing silence with librosa trim effect and then concatenating 3*hop_length of silence (0's vector), also the stop token you just need to preprocess your tokens to make this: If your phrase is "my name is Lucas." you process it to be " my name is Lucas ", where is a token to indicates that the sentences is starting and where the sentence ends.

I am not clear how long of the silence I should add (in seconds) and how to do it. My model using the default value of hop_length=256 can I read all my audio data as numpy array, add zeros in the end and rewrite file?

Check out the get_mel and get_audio functions in my repo, i use trim to normalize initial and ending silences and add 5*hop_length at the beginning and ending of the raw wavform data (just putting 0's vector as you said) https://github.com/lucashueda/pt_etts/blob/master/data_preparation.py , i get this functions from other issue and just change it a little to my project.

EuphoriaCelestial commented 3 years ago

Oh man the tokens doesn't appear. "SOS my name is Lucas EOS" SOS is the start of sequence token and EOS is the end of sequence

what kind of token can I use? a special symbol? can I do this: "~ this is a sentence ." do they need to be different or I can use 1 symbol for both SOS and EOS?

I didn't make ablations on the token itself, so by now im using the default text processing of this repository and it is working.

where is the default text processing file? I cant file anything related to EOS SOS tokens

lucashueda commented 3 years ago

Oh man the tokens doesn't appear. "SOS my name is Lucas EOS" SOS is the start of sequence token and EOS is the end of sequence

what kind of token can I use? a special symbol? can I do this: "~ this is a sentence ." do they need to be different or I can use 1 symbol for both SOS and EOS?

I didn't make ablations on the token itself, so by now im using the default text processing of this repository and it is working.

where is the default text processing file? I cant file anything related to EOS SOS tokens

The text processing is here https://github.com/lucashueda/pt_etts/blob/master/text/__init__.py , in fact it doesn't put eos or sos tokens, but if you put a "." in the every ending of sentences i think it is sufficient to make the model work.

As i already said im still new to this field, so my intuition is that the trim helps attention alignment by normalizing the start and ending of all audios . And the silence in the end linked to a '.' pattern in every ending help the models to learn when to finish the decoder steps

EuphoriaCelestial commented 3 years ago

Oh man the tokens doesn't appear. "SOS my name is Lucas EOS" SOS is the start of sequence token and EOS is the end of sequence

what kind of token can I use? a special symbol? can I do this: "~ this is a sentence ." do they need to be different or I can use 1 symbol for both SOS and EOS?

I didn't make ablations on the token itself, so by now im using the default text processing of this repository and it is working.

where is the default text processing file? I cant file anything related to EOS SOS tokens

The text processing is here https://github.com/lucashueda/pt_etts/blob/master/text/__init__.py , in fact it doesn't put eos or sos tokens, but if you put a "." in the every ending of sentences i think it is sufficient to make the model work.

As i already said im still new to this field, so my intuition is that the trim helps attention alignment by normalizing the start and ending of all audios . And the silence in the end linked to a '.' pattern in every ending help the models to learn when to finish the decoder steps

I also think about adding "." in every sentences before, but I looked at the example dataset they used for the published model, not all of their sentences have end symbols; so I havent tried but if you tested and it worked, I will give it a try. Thanks for your help!

xDuck commented 3 years ago

I chose a pretty arbitrary small amount of silence that seemed to work for me, too much and you’ll have other issues. In front I did 150ms and in back I did 300.

I do not use a start token, but I do use a stop token “~”. I modified the text processing code to always include that symbol so I don’t need to manually add it to training days or during inference.

On Sat, Oct 24, 2020 at 12:08 AM EuphoriaCelestial notifications@github.com wrote:

Oh man the tokens doesn't appear. "SOS my name is Lucas EOS" SOS is the start of sequence token and EOS is the end of sequence

what kind of token can I use? a special symbol? can I do this: "~ this is a sentence ." do they need to be different or I can use 1 symbol for both SOS and EOS?

I didn't make ablations on the token itself, so by now im using the default text processing of this repository and it is working.

where is the default text processing file? I cant file anything related to EOS SOS tokens

The text processing is here https://github.com/lucashueda/pt_etts/blob/master/text/__init__.py , in fact it doesn't put eos or sos tokens, but if you put a "." in the every ending of sentences i think it is sufficient to make the model work.

As i already said im still new to this field, so my intuition is that the trim helps attention alignment by normalizing the start and ending of all audios . And the silence in the end linked to a '.' pattern in every ending help the models to learn when to finish the decoder steps

I also think about adding "." in every sentences before, but I looked at the example dataset they used for the published model, not all of their sentences have end symbols; so I havent tried but if you tested and see it worked, I will give it a try. Thanks for your help!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/407#issuecomment-715669858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABICRILLD2I3ZFOEDFF4JHLSMJHLFANCNFSM4P2CL5OQ .

EuphoriaCelestial commented 3 years ago

@xDuck and how did you add silence to audio files? can I have the code?

xDuck commented 3 years ago

I do not have the code on hand anymore, but it was super short python code using one of the many audio libraries. Just remove all silence at start and end then add silence back in length you want.

I think the library I used was PyDub.

On Sat, Oct 24, 2020 at 12:35 AM EuphoriaCelestial notifications@github.com wrote:

@xDuck https://github.com/xDuck and how did you add silence to audio files? can I have the code?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/407#issuecomment-715672132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABICRIKXPXTXIO5EH3Z4NSDSMJKRBANCNFSM4P2CL5OQ .

EuphoriaCelestial commented 3 years ago

Thank you, I will try to do it

NVIDIA / tacotron2

Audio always max length #407