Attention weights with partial flat line (non-english)

SornrasakC commented 3 years ago

Hi, I have been trying to train this model with Thai-dataset (1 speaker, ~5 hour).

After ~80k Steps (batch size = 1, ~31 epoch), the attention weights turns out like this

Is it normal to see partial flat lines like this? all the issues I looked through only sees entire flat line or just straight diagonal... Or am I being too impatient? it's just 80k steps after all.

Here's some additional info

(Is this even correct?)

The above result comes from me warm starting the model from flowtron_ljs.pt with the flow=1 config file (speaker_embedding.weight ignored)

Things I have done

Trim the start and end of the sound using Librosa, also filtered out any data with a duration longer than 10 secs.
Set the p_arpabet=0 and instead of using Thai symbols, I convert all my filelists into IPAs first, and then add those IPAs symbols to the symbols.py
Change the cleaner so that they don't transliterate.
Add this line to ignore embedding.weight since they have different shape during warmstart.

Additional Questions

Does the above gate output make sense? does this means the model think all sounds end at 350-ich frame?
What happens if I train the model with only one flow? can I still do the inference and/or style transfer with lower score? or would it just goes un-usable due to absence of reverse mapping(?)
To my understanding, the process I should be going through is first, train the model flow=1 until attentions aligned, seconds, same but flow=2, and then third, turns the attn_prior off to attends the speaker. What's the sign to look for during third step? how do I know if the model has attended?
In default config, the ctc_loss starts at 10k iters, do I need to change this? does starting this earlier or later affects anything?
Does warm starting from flowtron_ljs really helps in learning different language? I'm wondering which parts did it helps with, the decoder?
Regarding emotion transfer, If I happened to get this Thai-dataset working, can I use an English dataset to transfer the emotion into this? Or do I also need Thai-dataset with emotion labeled as well?

Thank you for reading and would really appreciate any answers or suggestions.

Bahm9919 commented 3 years ago

I think The problem is representation of your symbols with text. You must get good alignment in 10k steps. If don't its mean something get wrong.

Yes. Warmstart helping.

I don't know whats happening, but yes you can do inference with only 1 flow.

SornrasakC commented 3 years ago

I think The problem is representation of your symbols with text. You must get good alignment in 10k steps. If don't its mean something get wrong.

Yes. Warmstart helping.

I don't know whats happening, but yes you can do inference with only 1 flow.

I guess.. I will try changing my symbols to ASCII, will come update soon.

SornrasakC commented 2 years ago

Sorry for a very late reply.

I tried changing my symbols to characters such as @A1 @A2 along with a new filelists that is already converted to the added symbols.

Here's the results after 20k iters, which I honestly think doesn't have any differences... I wonder where did it goes wrong.

SornrasakC commented 2 years ago

Turn out, 20 out of ~2k audio files was just pure noise, removing them solves the problem.

SornrasakC commented 2 years ago

@Bahm9919 Sorry for the @ and for commenting on the closed issue. I saw that it seems like you just recently able to do the inference, would it be possible for you to kindly share how you did it?

Especially

Torch version
Any changes in inference.py
Which WaveGlow weight
Does your submodule has this commit? Submodule path 'tacotron2': checked out '6f435f7f29c3e1553cf2dd7ca2daf56903b20c39'

Or anything else, surprisingly? Very much appreciated.

Bahm9919 commented 2 years ago

@Bahm9919 Sorry for the @ and for commenting on the closed issue. I saw that it seems like you just recently able to do the inference, would it be possible for you to kindly share how you did it?

Especially

Torch version

Any changes in inference.py

Which WaveGlow weight

Does your submodule has this commit? Submodule path 'tacotron2': checked out '6f435f7f29c3e1553cf2dd7ca2daf56903b20c39'

Or anything else, surprisingly? Very much appreciated.

I see your issue, will answer there.

NVIDIA / flowtron

Attention weights with partial flat line (non-english) #137