detailed work pipeline to train a multi-speaker flowtron model

NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer

https://nv-adlr.github.io/Flowtron

Apache License 2.0

887 stars 177 forks source link

detailed work pipeline to train a multi-speaker flowtron model #113

Open JohnHerry opened 3 years ago

JohnHerry commented 3 years ago

Hi, all, I am new to this job, had any body try to train a flowtron in multi-speaker model? It seems there need a TWO-STAGE trainging for flowtron. But there is only one config.json file. I don't know how to modify this config in the two-stage trining. What does the “n_flows” mean? Is there any demo for a multi-speaker instance? and if my language is not English, what are the work steps should I do?

rafaelvalle commented 3 years ago

We provide a checkpoint for libritts with over 2k speakers. Turn the attention prior to True before training. After training for some time,set it to false once the model has learned to attend and resume training

On Tue, Apr 6, 2021, 5:03 AM JohnHerry @.***> wrote:

Hi, all, I am new to this job, had any body try to train a flowtron in multi-speaker model? It seems there need a TWO-STAGE trainging for flowtron. But there is only one config.json file. I don't know how to modify this config in the two-stage trining. Is there any demo for a multi-speaker instance?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/flowtron/issues/113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARSFDYPLFJPMLVJMTYOW4TTHL2A3ANCNFSM42OTBSJQ .

JohnHerry commented 3 years ago

@rafaelvalle Thanks for you help. I am training flowtron on other languages instead of English, So I had to train from scratch. There is no pretrained tacotron2 model for me as a text encoder, So what I need is to train a tacotron2 on my mulit-speech corpus?

rafaelvalle commented 3 years ago

no, you will not need tacotron 2. just make sure to turn the attention prior to True until the model learns attention. it's ok to train 2 steps of flow at once. then turn the attention prior to False and resume training. https://github.com/NVIDIA/flowtron/blob/master/config.json#L34

JohnHerry commented 3 years ago

@rafaelvalle My config.json is as follows: 2021saa

I change three values according to my dataset, and I did not set the use_attn_prior, instead, I restirctly using the training command in your document:

python train.py -c config.json -p data_config.use_attn_prior=1

in our dataset there are speech of 67 hours from 142 speakers

Should I firstly change the parameter "n_flows" as value 1 at start step to good attention, then as value 2 at the second step? and so on?

How many steps should I train to get the first step attention?

JohnHerry commented 3 years ago

I had run first step from scratch for three days: Totally 6 RTX3090 GPU, but the attention still seems strange now. Is there any problem? first

JohnHerry commented 3 years ago

@rafaelvalle What does the x-ticks and y-ticks mean in the attetion plot? I see attention channels are 640, while my attention image above makes x-ticks to 200 and y-ticks to 70; what does these mean?

I had used config.json with n_texts=200; I saw that there are little samples whose text length over 160; so I removed samples whose text length are bigger then 160. But the Attention picture is not good too.

Is there any suggestion about how to use attention plot to find my problems? I think most of those problems are about preprocessing. though.

JohnHerry commented 3 years ago

My corpus is about multiple speakers, but my speaker ids are not consistant integers. There are 142 different speakers , while speaker id range from 1 to 240, many middle speaker samples are deleted dure to low count of samples. Is this the reason for the bad attention?