Open jayavanth opened 6 years ago
Also, does anyone have the text? It seems wasteful to run Google speech on the dataset every time.
I am currently training LJ dataset (in singlespeaker mode, but i think that multispeaker will work just fine.)
Here is the result after 120k steps. (It's not that good for some words/phrases, but in overall, it's okay) https://www.dropbox.com/s/we2sb4j1x056nkn/lj-test.wav?dl=0
However, due to some modifications made for korean dataset, there are several modifications required(especially with tokenization part) to use it for english(or other) languages. I will make pull request after I validate my modifications (e.g. trying it on multispeaker mode) .
For texts, you can retain your recognition.json file per dataset. After you finished generating your dataset (mentioned in section 2-2-5), you just need to run section 3 every time. If you need korean data, you can check older repos for some of the data. (I could find moon, park and yuinna)
Thanks! I didn't test this code on English model so any PR related to it is welcome.
Well, I trained till 152k step and got the result better than /keithito/tacotron's LJ pretrained model. Seems my modifications works (at least for single speaker).
My result: https://www.dropbox.com/s/tal53hyyqzbugog/lj-152k-test.wav?dl=0 From keithito/tacotron (tacotron-20170720): https://www.dropbox.com/s/xmemr6fei4ncwid/lj-keithito-pretrained.wav?dl=0
I will now try testing this on multispeaker mode, and if it goes okay, I will PR this as soon as possible. (in a few days, I hope)
@engiecat The examples sound amazing! Looking forward to your PR.
Tried on multispeaker, but my setting seems little bit problematic, which caused the model to train really slowly. (It became remotely intelligible after 500k steps). Tried another hparams and it works well.
However, this does not seems to be related with this PR, so don't worry. (I had to modify some hparams because of my H/W limitations)
Anyways, sorry for the delay!
Does anyone have a trained English model for download?
@johnbie Here is singlespeaker model trained with LJ dataset for 120k steps. (shown in https://github.com/carpedm20/multi-speaker-tacotron-tensorflow/issues/6#issuecomment-356831556) https://www.dropbox.com/s/l65s3b0xdtde6as/LJ_120kstep_20180111.zip?dl=0
Enjoy!
@engiecat Is it possible to turn this model into a multispeaker? I'm trying it out on the file you shared, but was unable to work it out. Also, thanks for the file.
@johnbie nope. the hyperparameters are different. It is just for singlespeaekr
Is training for multispeaker English model straightforward? I'm planning to use VCTK, LJ and/or Blizzard.
yes i think it is straightforward(thou u have to change some hparams but it is commented inside) the hparams in the repos is for english singlespeaker. I am currently training LJ with proprietary audiobook
Cool. How are the results coming out?
Also what do you have to change for the hyperparameters? just the model type section?
@johnbie Well, it works with DeepVoice2 VCTK dataset setting but not that good (it is very metallic) https://www.dropbox.com/s/5l6iez5std8dqxa/lj-deepvoiceVCTK-114k.wav?dl=0
I changed the hparams to use batch_size=32 (from 16) as the previous setting had terrible training speed. For hparams, to change language you should change cleaners to 'english_cleaners'. (check out my repos for example)
I will probably try Deepvoice2 audiobook setting later and I am currently testing VCTK setting with CMU_dict.
Alignment seems okay thou
I was training a multispeaker model with default dataset settings. I had batch size 32 and english cleaner.
Is it possible to switch to different dayaset settings midway?
@johnbie yes instead of hparams.py, you should modify params.json file to modify it. (e.g. batch_size)
Naturally, if I switch parameters from single speaker configuration to deep voice AudioBook configuration, I get Assign requires shapes of both tensors to match error. Would I have to retrain from the beginning for multi-speaker?
lhs shape= [512,1025] rhs shape= [256,1025]
Also, how significant is it switching from batch size 16 to 32, and 32 to 64? How many iterations did you cut from it?
This is the result from using single-speaker dataste settings, but using deepvoice and english_cleaners. Should I continue training this?
https://www.dropbox.com/s/33hwk6dk7gtscy2/result-for-step-127000.zip?dl=0
@johnbie
I don't think that using single speaker setting with deepvoice will work, and you should hear some words in 35000 step with 32 batch_size
How's Deepvoice2 audio settings going? I'm currently testing it myself, and I have yet to confirm whether it's working or not.
This is a plot of the training I did with Deepvoce2 settings. It has three voices, and training seems to be slow. I started vaguely hearing the first syllables from around 20k iterations, but there hasn't been significant result yet. I'm using english cleaners and batch size 32.
Does the plot of the training seem okay?
https://www.dropbox.com/s/hrra5hkkgi7p0lm/Multispeech_3.0.zip?dl=0
@johnbie I successfully trained deepvoice2 with VCTK setting. (Batch_size 96 and 32) (Each with Tesla P40 (which is emptying my wallet rather quickly) and GTX1060 6GB)
For the first training, I used two LJ Speech dataset till 114k step(32 batch size). Below is the result. https://www.dropbox.com/s/5l6iez5std8dqxa/lj-deepvoiceVCTK-114k.wav?dl=0 (Very croaky) (I am now testing with cmudict-0.7b (with 96 batch size) and it works relatively better(less croaky).) https://www.dropbox.com/s/qwsx05mm9jujjpw/lj-cmudict-41k.wav?dl=0
Then, I tried to add another voice by training the former one (the one trained till 114k step) This time, I used
These are the results after trained till 93k steps. (alignment for lj was restored in 3.5k step and GoT audiobook voice was working from 35~40k step) The results are somewhat croaky :(. This one is audiobook https://www.dropbox.com/s/1ycwbu1eoaptloo/got-93k-test.wav?dl=0 And this one is LJ https://www.dropbox.com/s/tbok5xouzeh0r8i/lj-got-93k-test.wav?dl=0
Additionally, I do observe that the audiobook-specific pronunciations (such as "Daenerys", "Arya") are corrected, even for LJ speech voice. https://www.dropbox.com/s/3l4yo4te30dd4n2/got-93k-test2.wav?dl=0 https://www.dropbox.com/s/uc4gjsln6etpzgw/lj-got-93k-test2.wav?dl=0
PS. I checked out ur zip file but it seems that the training gets little bit strange. Can you post the hparams.py? (Here is mine) Or it may be possible that the audio files may have inconsistent silence in the beginning. Btw, your voice seems less croaky, I am curious how you did it :)
For now, i am trying to use LJ+LJ pretrained (with cmudict) to train with other dataset.
Here's my hyperparams. As I mentioned, I trained with deep voice audiobook configuration. I loaded three voices on it. The first of which is LJSpeech, and the other two which are audiodata of my select celebrities. https://www.dropbox.com/s/b39pys80dei02mk/hparams.py?dl=0
And here's the latest model for it. It's still in it's infantile stage, and I fear that it's staying there too long. Maybe you can find what's wrong with it. It's likely because of the audio data though. https://www.dropbox.com/s/hrra5hkkgi7p0lm/Multispeech_3.0.zip?dl=0
If possible, can I take a look at your model? Maybe I can find out why it's so croaky.
@johnbie Here is my model, trained with LJ and LJ, with VCTK hparams till 114000. The testing set seems okay, but when I test it with my own words, the sounds get croaky. https://www.dropbox.com/s/cg2jstd9g7wi6ul/lj%2Blj-VCTK-noCMUDict-114000.zip?dl=0
and by the look of the hparams, there seems no problem. I had never used Audiobook hparams though (I am training one currently, but it is too early to say whether it works.)
Dataset preparation may be an issue. In my case, bad alignment between sound and transcript definitely caused wrong audio length prediction, and caused the model to deteriorate after some stage of training. Also #4 (in korean though) reported that inclusion of non-voice element(e.g. music) causes overall noise(may explain the 'croakiness').
Could you send me the params file? I can't seem to get it to work...
@johnbie params - 복사본.txt
this is params.json file
This is the error I got before I got the new params, and I still get it...
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
@johnbie Can you post the full traceback? and can you compare between my params.json and urs?
Doesn't work in audiobook setup. (batch size 16, reduction factor 5 meh.)
['datasets/LJSpeech-1.1', 'datasets/LJSpeech-1.1']
========================================
[!] Detect non-krbook dataset. May need to set sampling rate from 22050 to 20000
========================================
[*] Checkpoint path: logs/LJ-LJ_multispeak\model.ckpt
[*] Loading training data from: ['datasets/LJSpeech-1.1\\data', 'datasets/LJSpeech-1.1\\data']
[*] Using model: logs/LJ-LJ_multispeak
Hyperparameters:
adam_beta1: 0.9
adam_beta2: 0.999
attention_size: 256
attention_state_size: 256
attention_type: bah_mon
batch_size: 32
cleaners: english_cleaners
dec_layer_num: 2
dec_prenet_sizes: [256, 128]
dec_rnn_size: 256
decay_learning_rate_mode: 1
dropout_prob: 0.8
embedding_size: 256
enc_bank_channel_size: 128
enc_bank_size: 16
enc_highway_depth: 4
enc_maxpool_width: 2
enc_prenet_sizes: [256, 128]
enc_proj_sizes: [128, 128]
enc_proj_width: 3
enc_rnn_size: 128
frame_length_ms: 50
frame_shift_ms: 12.5
griffin_lim_iters: 60
ignore_recognition_level: 1
initial_data_greedy: True
initial_learning_rate: 0.001
initial_phase_step: 8000
main_data: ['']
main_data_greedy_factor: 0
max_iters: 200
min_iters: 30
min_level_db: -100
min_tokens: 30
model_type: deepvoice
num_freq: 1025
num_mels: 80
post_bank_channel_size: 256
post_bank_size: 8
post_highway_depth: 4
post_maxpool_width: 2
post_proj_sizes: [256, 80]
post_proj_width: 3
post_rnn_size: 256
power: 1.5
preemphasis: 0.97
prioritize_loss: False
recognition_loss_coeff: 0.2
reduction_factor: 5
ref_level_db: 20
sample_rate: 22050
skip_inadequate: False
speaker_embedding_size: 16
use_fixed_test_inputs: False
filter_by_min_max_frame_batch: 100%|########################################################################################################################################| 13100/13100 [00:20<00:00, 624.27it/s]
[datasets/LJSpeech-1.1\data] Loaded metadata for 12829 examples (23.87 hours)
[datasets/LJSpeech-1.1\data] Max length: 810
[datasets/LJSpeech-1.1\data] Min length: 150
filter_by_min_max_frame_batch: 100%|########################################################################################################################################| 13100/13100 [00:20<00:00, 625.20it/s]
[datasets/LJSpeech-1.1\data] Loaded metadata for 12829 examples (23.87 hours)
[datasets/LJSpeech-1.1\data] Max length: 810
[datasets/LJSpeech-1.1\data] Min length: 150
========================================
{'datasets/LJSpeech-1.1\\data': 1.0}
========================================
filter_by_min_max_frame_batch: 100%|########################################################################################################################################| 13100/13100 [00:21<00:00, 612.87it/s]
[datasets/LJSpeech-1.1\data] Loaded metadata for 12829 examples (23.87 hours)
[datasets/LJSpeech-1.1\data] Max length: 810
[datasets/LJSpeech-1.1\data] Min length: 150
filter_by_min_max_frame_batch: 100%|########################################################################################################################################| 13100/13100 [00:21<00:00, 621.50it/s]
[datasets/LJSpeech-1.1\data] Loaded metadata for 12829 examples (23.87 hours)
[datasets/LJSpeech-1.1\data] Max length: 810
[datasets/LJSpeech-1.1\data] Min length: 150
========================================
{'datasets/LJSpeech-1.1\\data': 1.0}
========================================
Traceback (most recent call last):
File "train.py", line 336, in <module>
main()
File "train.py", line 332, in main
train(config.model_dir, config)
File "train.py", line 160, in train
is_randomly_initialized=is_randomly_initialized)
File "F:\Projects\TTS\multi-speaker-tacotron-tensorflow\models\tacotron.py", line 49, in initialize
speaker_embed = tf.nn.embedding_lookup(speaker_embed_table, speaker_id)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\embedding_ops.py", line 294, in embedding_lookup
transform_fn=None)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\embedding_ops.py", line 120, in _embedding_lookup_and_transform
ids = ops.convert_to_tensor(ids, name="ids")
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 611, in convert_to_tensor
as_ref=False)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 676, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\constant_op.py", line 121, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 364, in make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
After making some code changes, i get a different set of errors as well.
Caused by op 'save/Assign_36', defined at:
File "train.py", line 337, in <module>
main()
File "train.py", line 333, in main
train(config.model_dir, config)
File "train.py", line 184, in train
saver = tf.train.Saver(max_to_keep=None, keep_checkpoint_every_n_hours=2)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1140, in __init__
self.build()
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1172, in build
filename=self._filename)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 688, in build
restore_sequentially, reshape)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 419, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 155, in restore
self.op.get_shape().is_fully_defined())
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\state_ops.py", line 274, in assign
validate_shape=validate_shape)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 43, in assign
use_locking=use_locking, name=name)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
[[Node: save/Assign_36 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/embedding"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](model/inference/embedding, save/RestoreV2_36/_1343)]]
Traceback (most recent call last):
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call
return fn(*args)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1306, in _run_fn
status, run_metadata)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
[[Node: save/Assign_36 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/embedding"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](model/inference/embedding, save/RestoreV2_36/_1343)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 201, in train
saver.restore(sess, restore_path)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1560, in restore
{self.saver_def.filename_tensor_name: save_path})
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
options, run_metadata)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
[[Node: save/Assign_36 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/embedding"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](model/inference/embedding, save/RestoreV2_36/_1343)]]
Caused by op 'save/Assign_36', defined at:
File "train.py", line 337, in <module>
main()
File "train.py", line 333, in main
train(config.model_dir, config)
File "train.py", line 184, in train
saver = tf.train.Saver(max_to_keep=None, keep_checkpoint_every_n_hours=2)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1140, in __init__
self.build()
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1172, in build
filename=self._filename)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 688, in build
restore_sequentially, reshape)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 419, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 155, in restore
self.op.get_shape().is_fully_defined())
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\state_ops.py", line 274, in assign
validate_shape=validate_shape)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 43, in assign
use_locking=use_locking, name=name)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
[[Node: save/Assign_36 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/embedding"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](model/inference/embedding, save/RestoreV2_36/_1343)]]
hi, @johnbie Thank you for post you error. Did you figure out it? i faced same error
hi, @johnbie Thank you for post you error. Did you figure out it? i faced same error
i figure it out
just change num_speaker=1 to 2
Has anyone trained this with english datasets? Also, section 2.2 in the wiki is for generating text for the speech dataset, right?