Open bellfive opened 4 years ago
Sorry that I have not released the complete preprocessing which in fact is consistent with Tacotron-2. So what you need to do is just run preprocess.py
like that from Tacotorn-2 and then move the training_data
directory into this project.
As for multi-speaker corpus you can download several open corpus like LJSpeech
, TIMIT
and so on and then put them under where as --dataset-path
option indicates and name each of the speaker directory as --training-anchor-dirs
option indicated in train_tacotron2.sh. The preprocess.py
will build training_data
directory and do the remaining followup jobs for you. Any question is welcome!
Thanks for your kindly reply. I get code from Tacotron-2 git. and run preprocess.py.
but I have a problem. what is '.trn' file in build_from_path function?
`def build_from_path(hparams, input_dir, wav_dir, mel_dir, n_jobs=12, tqdm=lambda x: x): futures = [] executor = ProcessPoolExecutor(max_workers=n_jobs)
for root, _, files in os.walk(input_dir):
for f in files:
print("build from : " + str(f))
**if f.endswith('.trn'):**
trn_file = os.path.join(root, f)
with open(trn_file) as f:
basename = trn_file[:-4]
wav_file = basename + '.wav'
basename = basename.split('/')[-1]
text = f.readline().strip()
futures.append(executor.submit(partial(_process_utterance, wav_dir, mel_dir, basename, wav_file, text, hparams)))
return [future.result() for future in tqdm(futures) if future.result() is not None]`
I have only wav files.
I find history of this file I have to use this file?
The abreviation of trn
stands for transcript where there is Chinese Pinyin or other transcript symbols for other languages. There are three kinds of files in the structure of my corpus: wav for raw audio, txt for text and trn for transcription. With the extraction of mel spectrograms as targets the transcript file records the source for the training and will be added into train.txt
as metadata to feed as inputs.
Thank you for your reply. I am training korean tacotron model.
I think this model is not trained well.
can you give me a hint for hyper parameter? or any advise?
I am using my Chinese multi-speaker corpus with different genders and the model has learned within 10 epochs. In my own opinion 4~5 epochs is enough for alignment. The only modification you need to do is in script/train_tacotron2.sh
. I am afraid you need to find the way on your own resources such as adapting the symbols for your Korean language and so on.
By the way if you doubt that there is problem in this project you can try Tensorflow version multi-speaker repo https://github.com/begeekmyfriend/Tacotron-2. If it still fails I am afraid it might well be your own issue.
@begeekmyfriend Thank you very much ^^ I'll check my model and I'll report result
@begeekmyfriend
I got a right result in korean tacotron2 model. I think It's depends dataset.
I have one female voice actor data 12843 wav files(10hrs). and other dataset has few minutes. only 100 wav files.
I train 3 dataset (female voice actor, 2 dataset) result is here.
and only one train dataset( female voice actor) is here.
So. I have few question.
1. What's your dataset structure? I can see you use 4 dataset( xmly_fanfanli_22050, xmly_xiaoya_22050, xmly_jinhua_22050, xmly_qiuyixin_22050)
Are all datasets of similar length? and How long are datasets??
2. multi speaker add problem I see anchor feature can do multi-speaker function. but after training 3 dataset. and add 1 dataset.
then I can't use trained model. cause input shape is
[args.n_symbols * speaker_num ,args.symbols_embedding_dim]
I have to retrain all dataset, when new speaker add?
Thank you for your kind and detailed answer.
The structure of my dataset directories is as follows. And you might add new anchors in option --training-anchor-dirs
in script/train_tacotron2.sh
.
tacotron2
└── training_data
├── xmly_fanfanli_22050
│ ├── audio/*.npy
│ ├── mels/*.npy
│ └── train.txt
├── xmly_xiaoya_22050
│ ├── audio/*.npy
│ ├── mels/*.npy
│ └── train.txt
├── xmly_jinhua_22050
│ ├── audio/*.npy
│ ├── mels/*.npy
│ └── train.txt
└── xmly_qiuyixin_22050
├── audio/*.npy
├── mels/*.npy
└── train.txt
As for the length of each clip you might set the largest mel and text length in preprocessor.py. the numbers of clips of respective anchor are close.
Each anchor has his or her own symbol embedding.
Thank you for your kind explanation.
I want to How long is your data set? How many sentences is each data?
I would really appreciate it if you could share the approximate information, even if it is not accurate.
And I wonder whether each dataset is all the same size data.
What we want is whether we have one large dataset and then we can learn by appending some smaller datasets (mult-speaker).
Thank you
The clip number of each anchor in my dataset is 5,000 above and the sizes are variant (the least is only 5,000 and the most is up to 9,000). Maybe I need to try some ASR corpus with less clips with each anchor to find out what is the least clip number this model need. I would do this later.
@bellfive Hi! Have you resolved your problem? Recently, I have similar problem that mono data set can get normal alignment but multi-speaker data sets lead to abnormal alignment as follows:
I read you can't publish your multi-speaker corpus.
so I want to train multi-speaker model with my corpus. how to setup dataset in multi-speaker? put one traing_data?
and one more question. in train_tacotron2.sh file
xmly_fanfanli_22050 xmly_xiaoya_22050
this directory has speaker's data? mel data? or other data? (etc. speaker embedding output file)
BR