begeekmyfriend / tacotron2

Forked from NVIDIA/tacotron2 and merged with Rayhane-mamah/Tacotron-2
BSD 3-Clause "New" or "Revised" License
81 stars 38 forks source link

how can train multi-speaker? #3

Open bellfive opened 4 years ago

bellfive commented 4 years ago

I read you can't publish your multi-speaker corpus.

so I want to train multi-speaker model with my corpus. how to setup dataset in multi-speaker? put one traing_data?

and one more question. in train_tacotron2.sh file

xmly_fanfanli_22050 xmly_xiaoya_22050

this directory has speaker's data? mel data? or other data? (etc. speaker embedding output file)

BR

begeekmyfriend commented 4 years ago

Sorry that I have not released the complete preprocessing which in fact is consistent with Tacotron-2. So what you need to do is just run preprocess.py like that from Tacotorn-2 and then move the training_data directory into this project.

As for multi-speaker corpus you can download several open corpus like LJSpeech, TIMIT and so on and then put them under where as --dataset-path option indicates and name each of the speaker directory as --training-anchor-dirs option indicated in train_tacotron2.sh. The preprocess.py will build training_data directory and do the remaining followup jobs for you. Any question is welcome!

bellfive commented 4 years ago

Thanks for your kindly reply. I get code from Tacotron-2 git. and run preprocess.py.

but I have a problem. what is '.trn' file in build_from_path function?

`def build_from_path(hparams, input_dir, wav_dir, mel_dir, n_jobs=12, tqdm=lambda x: x): futures = [] executor = ProcessPoolExecutor(max_workers=n_jobs)

for root, _, files in os.walk(input_dir):
    for f in files:
        print("build from : " + str(f))
        **if f.endswith('.trn'):**
            trn_file = os.path.join(root, f)

        with open(trn_file) as f:
                    basename = trn_file[:-4]
                    wav_file = basename + '.wav'
                    basename = basename.split('/')[-1]
                    text = f.readline().strip()
                    futures.append(executor.submit(partial(_process_utterance, wav_dir, mel_dir, basename, wav_file, text, hparams)))

    return [future.result() for future in tqdm(futures) if future.result() is not None]`

I have only wav files.

I find history of this file I have to use this file?

https://github.com/begeekmyfriend/Tacotron-2/blob/54593a02b73eb36ea0275184bef567e56d0a1b27/datasets/preprocessor.py

begeekmyfriend commented 4 years ago

The abreviation of trn stands for transcript where there is Chinese Pinyin or other transcript symbols for other languages. There are three kinds of files in the structure of my corpus: wav for raw audio, txt for text and trn for transcription. With the extraction of mel spectrograms as targets the transcript file records the source for the training and will be added into train.txt as metadata to feed as inputs.

bellfive commented 4 years ago

Thank you for your reply. I am training korean tacotron model.

I think this model is not trained well.

image

can you give me a hint for hyper parameter? or any advise?

begeekmyfriend commented 4 years ago

I am using my Chinese multi-speaker corpus with different genders and the model has learned within 10 epochs. In my own opinion 4~5 epochs is enough for alignment. The only modification you need to do is in script/train_tacotron2.sh. I am afraid you need to find the way on your own resources such as adapting the symbols for your Korean language and so on. align_0010_7190

begeekmyfriend commented 4 years ago

By the way if you doubt that there is problem in this project you can try Tensorflow version multi-speaker repo https://github.com/begeekmyfriend/Tacotron-2. If it still fails I am afraid it might well be your own issue.

bellfive commented 4 years ago

@begeekmyfriend Thank you very much ^^ I'll check my model and I'll report result

bellfive commented 4 years ago

@begeekmyfriend

I got a right result in korean tacotron2 model. I think It's depends dataset.

I have one female voice actor data 12843 wav files(10hrs). and other dataset has few minutes. only 100 wav files.

I train 3 dataset (female voice actor, 2 dataset) result is here.

image

and only one train dataset( female voice actor) is here.

image

So. I have few question.

1. What's your dataset structure? I can see you use 4 dataset( xmly_fanfanli_22050, xmly_xiaoya_22050, xmly_jinhua_22050, xmly_qiuyixin_22050)

Are all datasets of similar length? and How long are datasets??

2. multi speaker add problem I see anchor feature can do multi-speaker function. but after training 3 dataset. and add 1 dataset.

then I can't use trained model. cause input shape is

[args.n_symbols * speaker_num ,args.symbols_embedding_dim]

I have to retrain all dataset, when new speaker add?

Thank you for your kind and detailed answer.

begeekmyfriend commented 4 years ago

The structure of my dataset directories is as follows. And you might add new anchors in option --training-anchor-dirs in script/train_tacotron2.sh.

tacotron2
└── training_data
     ├── xmly_fanfanli_22050
     │     ├── audio/*.npy
     │     ├── mels/*.npy
     │     └── train.txt
     ├── xmly_xiaoya_22050
     │     ├── audio/*.npy
     │     ├── mels/*.npy
     │     └── train.txt
     ├── xmly_jinhua_22050
     │     ├── audio/*.npy
     │     ├── mels/*.npy
     │     └── train.txt
     └── xmly_qiuyixin_22050
            ├── audio/*.npy
            ├── mels/*.npy
            └── train.txt
begeekmyfriend commented 4 years ago

As for the length of each clip you might set the largest mel and text length in preprocessor.py. the numbers of clips of respective anchor are close.

Each anchor has his or her own symbol embedding.

bellfive commented 4 years ago

Thank you for your kind explanation.

I want to How long is your data set? How many sentences is each data?

I would really appreciate it if you could share the approximate information, even if it is not accurate.

And I wonder whether each dataset is all the same size data.

What we want is whether we have one large dataset and then we can learn by appending some smaller datasets (mult-speaker).

Thank you

begeekmyfriend commented 4 years ago

The clip number of each anchor in my dataset is 5,000 above and the sizes are variant (the least is only 5,000 and the most is up to 9,000). Maybe I need to try some ASR corpus with less clips with each anchor to find out what is the least clip number this model need. I would do this later.

begeekmyfriend commented 4 years ago

My fault https://github.com/begeekmyfriend/tacotron2/issues/5

kkokdari commented 4 years ago

@bellfive Hi! Have you resolved your problem? Recently, I have similar problem that mono data set can get normal alignment but multi-speaker data sets lead to abnormal alignment as follows: image