NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
890 stars 177 forks source link

Attention weights with partial flat line (non-english) #137

Closed SornrasakC closed 2 years ago

SornrasakC commented 3 years ago

Hi, I have been trying to train this model with Thai-dataset (1 speaker, ~5 hour).

After ~80k Steps (batch size = 1, ~31 epoch), the attention weights turns out like this

image

Is it normal to see partial flat lines like this? all the issues I looked through only sees entire flat line or just straight diagonal... Or am I being too impatient? it's just 80k steps after all.

Here's some additional info image

(Is this even correct?) image

The above result comes from me warm starting the model from flowtron_ljs.pt with the flow=1 config file (speaker_embedding.weight ignored)

Things I have done

Additional Questions

Thank you for reading and would really appreciate any answers or suggestions.

Bahm9919 commented 3 years ago

I think The problem is representation of your symbols with text. You must get good alignment in 10k steps. If don't its mean something get wrong.

Yes. Warmstart helping.

I don't know whats happening, but yes you can do inference with only 1 flow.

SornrasakC commented 3 years ago

I think The problem is representation of your symbols with text. You must get good alignment in 10k steps. If don't its mean something get wrong.

Yes. Warmstart helping.

I don't know whats happening, but yes you can do inference with only 1 flow.

I guess.. I will try changing my symbols to ASCII, will come update soon.

SornrasakC commented 2 years ago

Sorry for a very late reply.

I tried changing my symbols to characters such as @A1 @A2 along with a new filelists that is already converted to the added symbols.

Here's the results after 20k iters, which I honestly think doesn't have any differences... I wonder where did it goes wrong.

image

image

SornrasakC commented 2 years ago

Turn out, 20 out of ~2k audio files was just pure noise, removing them solves the problem.

image

SornrasakC commented 2 years ago

@Bahm9919 Sorry for the @ and for commenting on the closed issue. I saw that it seems like you just recently able to do the inference, would it be possible for you to kindly share how you did it?

Especially

Or anything else, surprisingly? Very much appreciated.

Bahm9919 commented 2 years ago

@Bahm9919 Sorry for the @ and for commenting on the closed issue. I saw that it seems like you just recently able to do the inference, would it be possible for you to kindly share how you did it?

Especially

  • Torch version
  • Any changes in inference.py
  • Which WaveGlow weight
  • Does your submodule has this commit? Submodule path 'tacotron2': checked out '6f435f7f29c3e1553cf2dd7ca2daf56903b20c39'

Or anything else, surprisingly? Very much appreciated.

I see your issue, will answer there.