jayavanth commented 6 years ago

Has anyone trained this with english datasets? Also, section 2.2 in the wiki is for generating text for the speech dataset, right?

jayavanth commented 6 years ago

Also, does anyone have the text? It seems wasteful to run Google speech on the dataset every time.

engiecat commented 6 years ago

I am currently training LJ dataset (in singlespeaker mode, but i think that multispeaker will work just fine.)

Here is the result after 120k steps. (It's not that good for some words/phrases, but in overall, it's okay) 89c02b8a3b4d9e06c6ba59fccc325662 0 manual https://www.dropbox.com/s/we2sb4j1x056nkn/lj-test.wav?dl=0

However, due to some modifications made for korean dataset, there are several modifications required(especially with tokenization part) to use it for english(or other) languages. I will make pull request after I validate my modifications (e.g. trying it on multispeaker mode) .

For texts, you can retain your recognition.json file per dataset. After you finished generating your dataset (mentioned in section 2-2-5), you just need to run section 3 every time. If you need korean data, you can check older repos for some of the data. (I could find moon, park and yuinna)

carpedm20 commented 6 years ago

Thanks! I didn't test this code on English model so any PR related to it is welcome.

engiecat commented 6 years ago

Well, I trained till 152k step and got the result better than /keithito/tacotron's LJ pretrained model. Seems my modifications works (at least for single speaker).

My result: https://www.dropbox.com/s/tal53hyyqzbugog/lj-152k-test.wav?dl=0 From keithito/tacotron (tacotron-20170720): https://www.dropbox.com/s/xmemr6fei4ncwid/lj-keithito-pretrained.wav?dl=0

I will now try testing this on multispeaker mode, and if it goes okay, I will PR this as soon as possible. (in a few days, I hope)

jayavanth commented 6 years ago

@engiecat The examples sound amazing! Looking forward to your PR.

engiecat commented 6 years ago

7 PR made for english datasets.

~~Tried on multispeaker, but my setting seems little bit problematic, which caused the model to train really slowly. (It became remotely intelligible after 500k steps).~~ Tried another hparams and it works well.

However, this does not seems to be related with this PR, so don't worry. (I had to modify some hparams because of my H/W limitations)

Anyways, sorry for the delay!

johnbie commented 6 years ago

Does anyone have a trained English model for download?

engiecat commented 6 years ago

@johnbie Here is singlespeaker model trained with LJ dataset for 120k steps. (shown in https://github.com/carpedm20/multi-speaker-tacotron-tensorflow/issues/6#issuecomment-356831556) https://www.dropbox.com/s/l65s3b0xdtde6as/LJ_120kstep_20180111.zip?dl=0

Enjoy!

johnbie commented 6 years ago

@engiecat Is it possible to turn this model into a multispeaker? I'm trying it out on the file you shared, but was unable to work it out. Also, thanks for the file.

engiecat commented 6 years ago

@johnbie nope. the hyperparameters are different. It is just for singlespeaekr

jayavanth commented 6 years ago

Is training for multispeaker English model straightforward? I'm planning to use VCTK, LJ and/or Blizzard.

engiecat commented 6 years ago

yes i think it is straightforward(thou u have to change some hparams but it is commented inside) the hparams in the repos is for english singlespeaker. I am currently training LJ with proprietary audiobook

johnbie commented 6 years ago

Cool. How are the results coming out?

johnbie commented 6 years ago

Also what do you have to change for the hyperparameters? just the model type section?

engiecat commented 6 years ago

@johnbie Well, it works with DeepVoice2 VCTK dataset setting but not that good (it is very metallic) https://www.dropbox.com/s/5l6iez5std8dqxa/lj-deepvoiceVCTK-114k.wav?dl=0

I changed the hparams to use batch_size=32 (from 16) as the previous setting had terrible training speed. For hparams, to change language you should change cleaners to 'english_cleaners'. (check out my repos for example)

engiecat commented 6 years ago

I will probably try Deepvoice2 audiobook setting later and I am currently testing VCTK setting with CMU_dict.

engiecat commented 6 years ago

Alignment seems okay thou

johnbie commented 6 years ago

I was training a multispeaker model with default dataset settings. I had batch size 32 and english cleaner.

Is it possible to switch to different dayaset settings midway?

engiecat commented 6 years ago

@johnbie yes instead of hparams.py, you should modify params.json file to modify it. (e.g. batch_size)

johnbie commented 6 years ago

Naturally, if I switch parameters from single speaker configuration to deep voice AudioBook configuration, I get Assign requires shapes of both tensors to match error. Would I have to retrain from the beginning for multi-speaker?

lhs shape= [512,1025] rhs shape= [256,1025]

johnbie commented 6 years ago

Also, how significant is it switching from batch size 16 to 32, and 32 to 64? How many iterations did you cut from it?

johnbie commented 6 years ago

This is the result from using single-speaker dataste settings, but using deepvoice and english_cleaners. Should I continue training this?

https://www.dropbox.com/s/33hwk6dk7gtscy2/result-for-step-127000.zip?dl=0

engiecat commented 6 years ago

@johnbie

see my comment in #16 . Single speaker and multispeaker configurations are not compatible, as well as multispeakers with different number of speakers. (The dimensions of tensors are different)
For batch size, 16->32 caused significant decrease in the number of iteration (about 30% of previously required iterations). I didn't try 64, and trying out with 128 did work(on Tesla P40), but the batch preparation time became very long. (more than 100 sec.)

I don't think that using single speaker setting with deepvoice will work, and you should hear some words in 35000 step with 32 batch_size

johnbie commented 6 years ago

How's Deepvoice2 audio settings going? I'm currently testing it myself, and I have yet to confirm whether it's working or not.

johnbie commented 6 years ago

This is a plot of the training I did with Deepvoce2 settings. It has three voices, and training seems to be slow. I started vaguely hearing the first syllables from around 20k iterations, but there hasn't been significant result yet. I'm using english cleaners and batch size 32.

Does the plot of the training seem okay?

https://www.dropbox.com/s/hrra5hkkgi7p0lm/Multispeech_3.0.zip?dl=0

engiecat commented 6 years ago

@johnbie I successfully trained deepvoice2 with VCTK setting. (Batch_size 96 and 32) (Each with Tesla P40 (which is emptying my wallet rather quickly) and GTX1060 6GB)

For the first training, I used two LJ Speech dataset till 114k step(32 batch size). Below is the result. https://www.dropbox.com/s/5l6iez5std8dqxa/lj-deepvoiceVCTK-114k.wav?dl=0 63f031ff4c61d0a88d97af45f0c45ff5 0 manual (Very croaky) (I am now testing with cmudict-0.7b (with 96 batch size) and it works relatively better(less croaky).) https://www.dropbox.com/s/qwsx05mm9jujjpw/lj-cmudict-41k.wav?dl=0 3228375689c395894943b593ce2745e9 0 manual

Then, I tried to add another voice by training the former one (the one trained till 114k step) This time, I used

LJSpeech-1.0 Dataset
Commercial audiobook dataset, very naively homebrew (Game of Thrones(by GRRM) read by Roy Dotrice (Audible)) (Severe misalignment present. I am trying to fix it.) (about 8.41 hrs of data actually used for training)

These are the results after trained till 93k steps. (alignment for lj was restored in 3.5k step and GoT audiobook voice was working from 35~40k step) The results are somewhat croaky :(. This one is audiobook 50456c3a28f30409df622a1bf077ab18 0 manual https://www.dropbox.com/s/1ycwbu1eoaptloo/got-93k-test.wav?dl=0 And this one is LJ https://www.dropbox.com/s/tbok5xouzeh0r8i/lj-got-93k-test.wav?dl=0

Additionally, I do observe that the audiobook-specific pronunciations (such as "Daenerys", "Arya") are corrected, even for LJ speech voice. https://www.dropbox.com/s/3l4yo4te30dd4n2/got-93k-test2.wav?dl=0 https://www.dropbox.com/s/uc4gjsln6etpzgw/lj-got-93k-test2.wav?dl=0

PS. I checked out ur zip file but it seems that the training gets little bit strange. Can you post the hparams.py? (Here is mine) Or it may be possible that the audio files may have inconsistent silence in the beginning. Btw, your voice seems less croaky, I am curious how you did it :)

engiecat commented 6 years ago

For now, i am trying to use LJ+LJ pretrained (with cmudict) to train with other dataset.

johnbie commented 6 years ago

Here's my hyperparams. As I mentioned, I trained with deep voice audiobook configuration. I loaded three voices on it. The first of which is LJSpeech, and the other two which are audiodata of my select celebrities. https://www.dropbox.com/s/b39pys80dei02mk/hparams.py?dl=0

johnbie commented 6 years ago

And here's the latest model for it. It's still in it's infantile stage, and I fear that it's staying there too long. Maybe you can find what's wrong with it. It's likely because of the audio data though. https://www.dropbox.com/s/hrra5hkkgi7p0lm/Multispeech_3.0.zip?dl=0

If possible, can I take a look at your model? Maybe I can find out why it's so croaky.

engiecat commented 6 years ago

@johnbie Here is my model, trained with LJ and LJ, with VCTK hparams till 114000. The testing set seems okay, but when I test it with my own words, the sounds get croaky. https://www.dropbox.com/s/cg2jstd9g7wi6ul/lj%2Blj-VCTK-noCMUDict-114000.zip?dl=0

and by the look of the hparams, there seems no problem. I had never used Audiobook hparams though (I am training one currently, but it is too early to say whether it works.)

Dataset preparation may be an issue. In my case, bad alignment between sound and transcript definitely caused wrong audio length prediction, and caused the model to deteriorate after some stage of training. Also #4 (in korean though) reported that inclusion of non-voice element(e.g. music) causes overall noise(may explain the 'croakiness').

johnbie commented 6 years ago

Could you send me the params file? I can't seem to get it to work...

engiecat commented 6 years ago

@johnbie params - 복사본.txt

engiecat commented 6 years ago

this is params.json file

johnbie commented 6 years ago

This is the error I got before I got the new params, and I still get it...

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]

engiecat commented 6 years ago

@johnbie Can you post the full traceback? and can you compare between my params.json and urs?

engiecat commented 6 years ago

Doesn't work in audiobook setup. (batch size 16, reduction factor 5 meh.)

johnbie commented 6 years ago

['datasets/LJSpeech-1.1', 'datasets/LJSpeech-1.1']
========================================
 [!] Detect non-krbook dataset. May need to set sampling rate from 22050 to 20000
========================================

 [*] Checkpoint path: logs/LJ-LJ_multispeak\model.ckpt
 [*] Loading training data from: ['datasets/LJSpeech-1.1\\data', 'datasets/LJSpeech-1.1\\data']
 [*] Using model: logs/LJ-LJ_multispeak
Hyperparameters:
    adam_beta1: 0.9
    adam_beta2: 0.999
    attention_size: 256
    attention_state_size: 256
    attention_type: bah_mon
    batch_size: 32
    cleaners: english_cleaners
    dec_layer_num: 2
    dec_prenet_sizes: [256, 128]
    dec_rnn_size: 256
    decay_learning_rate_mode: 1
    dropout_prob: 0.8
    embedding_size: 256
    enc_bank_channel_size: 128
    enc_bank_size: 16
    enc_highway_depth: 4
    enc_maxpool_width: 2
    enc_prenet_sizes: [256, 128]
    enc_proj_sizes: [128, 128]
    enc_proj_width: 3
    enc_rnn_size: 128
    frame_length_ms: 50
    frame_shift_ms: 12.5
    griffin_lim_iters: 60
    ignore_recognition_level: 1
    initial_data_greedy: True
    initial_learning_rate: 0.001
    initial_phase_step: 8000
    main_data: ['']
    main_data_greedy_factor: 0
    max_iters: 200
    min_iters: 30
    min_level_db: -100
    min_tokens: 30
    model_type: deepvoice
    num_freq: 1025
    num_mels: 80
    post_bank_channel_size: 256
    post_bank_size: 8
    post_highway_depth: 4
    post_maxpool_width: 2
    post_proj_sizes: [256, 80]
    post_proj_width: 3
    post_rnn_size: 256
    power: 1.5
    preemphasis: 0.97
    prioritize_loss: False
    recognition_loss_coeff: 0.2
    reduction_factor: 5
    ref_level_db: 20
    sample_rate: 22050
    skip_inadequate: False
    speaker_embedding_size: 16
    use_fixed_test_inputs: False
filter_by_min_max_frame_batch: 100%|########################################################################################################################################| 13100/13100 [00:20<00:00, 624.27it/s]
 [datasets/LJSpeech-1.1\data] Loaded metadata for 12829 examples (23.87 hours)
 [datasets/LJSpeech-1.1\data] Max length: 810
 [datasets/LJSpeech-1.1\data] Min length: 150
filter_by_min_max_frame_batch: 100%|########################################################################################################################################| 13100/13100 [00:20<00:00, 625.20it/s]
 [datasets/LJSpeech-1.1\data] Loaded metadata for 12829 examples (23.87 hours)
 [datasets/LJSpeech-1.1\data] Max length: 810
 [datasets/LJSpeech-1.1\data] Min length: 150
========================================
{'datasets/LJSpeech-1.1\\data': 1.0}
========================================
filter_by_min_max_frame_batch: 100%|########################################################################################################################################| 13100/13100 [00:21<00:00, 612.87it/s]
 [datasets/LJSpeech-1.1\data] Loaded metadata for 12829 examples (23.87 hours)
 [datasets/LJSpeech-1.1\data] Max length: 810
 [datasets/LJSpeech-1.1\data] Min length: 150
filter_by_min_max_frame_batch: 100%|########################################################################################################################################| 13100/13100 [00:21<00:00, 621.50it/s]
 [datasets/LJSpeech-1.1\data] Loaded metadata for 12829 examples (23.87 hours)
 [datasets/LJSpeech-1.1\data] Max length: 810
 [datasets/LJSpeech-1.1\data] Min length: 150
========================================
{'datasets/LJSpeech-1.1\\data': 1.0}
========================================
Traceback (most recent call last):
  File "train.py", line 336, in <module>
    main()
  File "train.py", line 332, in main
    train(config.model_dir, config)
  File "train.py", line 160, in train
    is_randomly_initialized=is_randomly_initialized)
  File "F:\Projects\TTS\multi-speaker-tacotron-tensorflow\models\tacotron.py", line 49, in initialize
    speaker_embed = tf.nn.embedding_lookup(speaker_embed_table, speaker_id)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\embedding_ops.py", line 294, in embedding_lookup
    transform_fn=None)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\embedding_ops.py", line 120, in _embedding_lookup_and_transform
    ids = ops.convert_to_tensor(ids, name="ids")
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 611, in convert_to_tensor
    as_ref=False)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 676, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\constant_op.py", line 121, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in constant
    tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 364, in make_tensor_proto
    raise ValueError("None values not supported.")
ValueError: None values not supported.

johnbie commented 6 years ago

After making some code changes, i get a different set of errors as well.

Caused by op 'save/Assign_36', defined at:
  File "train.py", line 337, in <module>
    main()
  File "train.py", line 333, in main
    train(config.model_dir, config)
  File "train.py", line 184, in train
    saver = tf.train.Saver(max_to_keep=None, keep_checkpoint_every_n_hours=2)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1140, in __init__
    self.build()
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1172, in build
    filename=self._filename)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 688, in build
    restore_sequentially, reshape)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 419, in _AddRestoreOps
    assign_ops.append(saveable.restore(tensors, shapes))
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 155, in restore
    self.op.get_shape().is_fully_defined())
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\state_ops.py", line 274, in assign
    validate_shape=validate_shape)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 43, in assign
    use_locking=use_locking, name=name)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
         [[Node: save/Assign_36 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/embedding"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](model/inference/embedding, save/RestoreV2_36/_1343)]]

Traceback (most recent call last):
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call
    return fn(*args)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1306, in _run_fn
    status, run_metadata)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\contextlib.py", line 66, in __exit__
    next(self.gen)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
         [[Node: save/Assign_36 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/embedding"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](model/inference/embedding, save/RestoreV2_36/_1343)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 201, in train
    saver.restore(sess, restore_path)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1560, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
    run_metadata_ptr)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
    options, run_metadata)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
         [[Node: save/Assign_36 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/embedding"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](model/inference/embedding, save/RestoreV2_36/_1343)]]

Caused by op 'save/Assign_36', defined at:
  File "train.py", line 337, in <module>
    main()
  File "train.py", line 333, in main
    train(config.model_dir, config)
  File "train.py", line 184, in train
    saver = tf.train.Saver(max_to_keep=None, keep_checkpoint_every_n_hours=2)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1140, in __init__
    self.build()
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1172, in build
    filename=self._filename)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 688, in build
    restore_sequentially, reshape)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 419, in _AddRestoreOps
    assign_ops.append(saveable.restore(tensors, shapes))
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 155, in restore
    self.op.get_shape().is_fully_defined())
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\state_ops.py", line 274, in assign
    validate_shape=validate_shape)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 43, in assign
    use_locking=use_locking, name=name)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "C:\Users\mwc2018\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [80,256] rhs shape= [65,256]
         [[Node: save/Assign_36 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/embedding"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](model/inference/embedding, save/RestoreV2_36/_1343)]]

quhb2455 commented 3 years ago

hi, @johnbie Thank you for post you error. Did you figure out it? i faced same error

quhb2455 commented 3 years ago

hi, @johnbie Thank you for post you error. Did you figure out it? i faced same error

i figure it out

just change num_speaker=1 to 2

carpedm20 / multi-speaker-tacotron-tensorflow

Alignment and English datasets #6

7 PR made for english datasets.