TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.8k stars 810 forks source link

Feature Request Thread #467

Closed dathudeptrai closed 2 years ago

dathudeptrai commented 3 years ago

Don't hesitate to tell me what features you want in this repo :)))

unparalleled-ysj commented 3 years ago

@dathudeptrai What do you think of voice cloning

mikus commented 3 years ago

I would like to see better componentization. There are similar blocks (groups of layers) implemented multiple times, like positional encoding, speaker encoding or postnet. Others relates on configuration specific just for one particular network like self-attention block used in FastSpeech. With a little rework, making those block more generic it would be easier to create new network types. Similarly with losses, e.g. training for hifigan contains many duplicated code from mb-melgan. Moreover, most of the training and inference scripts looks quite similar and I believe they can be refactored too to, once again, compose the final solution from more generic components.

And BTW, I really appreciate your work and think you did a great job! :)

dathudeptrai commented 3 years ago

training for hifigan contains many duplicated code from mb-melgan

Hmm, in this case, users just need to read and understand hifigan without reading mb-melgan.

ZDisket commented 3 years ago

@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot?

unparalleled-ysj commented 3 years ago

@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot? For example, given a short segment of the target person’s voice, the model does not need to be retrained to synthesize the voice of the speaker’s timbre, such as using voiceprint technology to extract speaker embedding to train a multi-speaker TTS model

ZDisket commented 3 years ago

@unparalleled-ysj That's what I was thinking about. Relevantly, @dathudeptrai I saw https://github.com/dipjyoti92/SC-WaveRNN, could SC-MB-MelGAN be possible?

luan78zaoha commented 3 years ago

@unparalleled-ysj @ZDisket That is also what I’m doing. I'm trying to train a multi-speaker fastspeech2 model replacing current hardcoding speaker-ID with bottleneck feature extracted by a voiceprint model. The bottleneck feature of continuous softcoding represents a speaker-related space. If the unknown voice is similar to a voice in the training space, voice cloning may be realized. But judging from the results of current open source projects, it is a difficult problem and certainly not as simple as I described. Do you have any good ideas?

mikus commented 3 years ago

One possible option for better support for multiple speakers or styles would be to add a Variable Auto-Encoder which automatically extracts this voice/style "fingerprint".

abylouw commented 3 years ago

LightSpeech https://arxiv.org/abs/2102.04040

nmfisher commented 3 years ago

@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech

Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out).

This is just the final architecture mentioned in the paper (so I haven't implemented any NAS).

Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too.

Some samples @ 170k (decoded with pre-trained MB-MelGan):

https://github.com/nmfisher/lightspeech_samples/tree/main/v1_170k

Noticeably worse quality than FastSpeech 2 at the same number of training steps, and it's falling apart on longer sequences.

dathudeptrai commented 3 years ago

@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech

Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out).

This is just the final architecture mentioned in the paper (so I haven't implemented any NAS).

Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too.

great! :D. how about a number of parameters in LightSpeech ?

nmfisher commented 3 years ago

My early version of LightSpeech is: image

By comparison, FastSpeech 2 (v1) is:

image

But given the paper claims 1.8M parameters for LightSpeech (vs 27M for FastSpeech 2), my implementation obviously still isn't 100% accurate. Feedback from the authors will help clarify the number of attention heads (and also the hidden size of each head).

Also I think the paper didn't implement PostNet, so removing that layer immediately eliminates ~4.3M parameters.

luan78zaoha commented 3 years ago

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

dathudeptrai commented 3 years ago

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

dathudeptrai commented 3 years ago

@nmfisher 6M params is small enough, did you get a good result with lighspeech ? . how fast is it ?

luan78zaoha commented 3 years ago

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

dathudeptrai commented 3 years ago

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

@luan78zaoha lightspeech use separableConvolution :D.

luan78zaoha commented 3 years ago

@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 0.018 and 0.01, respectively.

dathudeptrai commented 3 years ago

@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 55.6 and 98.0, respectively.

let wait @luan78zaoha reports lightspeech RTF :D.

nmfisher commented 3 years ago

@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.

yeah, Postnet is only for faster convergence, we can ignore it after the training process.

I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher

As @dathudeptrai mentioned, LightSpeech uses SeparableConvolution in place of regular Convolution, but then also passes various FastSpeech2 configurations through neural architecture search to determine the best configuration of kernel sizes/attention heads/attention dimensions. Basically they use NAS to find the smallest configuration that performs as well as FastSpeech2.

debasish-mihup commented 3 years ago

@dathudeptrai @Xuefeng Can you help me implement Higan with fastspeech2 on android? I have tried to implement the same by using https://github.com/tulasiram58827/TTS_TFLite/tree/main/models pretrained model and changing the line
https://github.com/TensorSpeech/TensorFlowTTS/blob/9a107d98bd20e8030c07e03a5857f62f36d69270/examples/android/app/src/main/java/com/tensorspeech/tensorflowtts/module/FastSpeech2.java#L73 to handle the input model data shape. But the output is pure noise.

StuartIanNaylor commented 3 years ago

Not really a request but just wondering about the use of Librosa. I have been playing around with https://github.com/google-research/google-research/tree/master/kws_streaming which uses internal methods for MFCC. The one it uses is the pyhon.ops one but the tf.signal was also quite a perf boost on using librosa.

Is there any reason for librosa over say tf.signal.stft and tf.signal.linear_to_mel_weight_matrix as they seem extremely performant?

Collin-Budrick commented 3 years ago

@dathudeptrai What do you think of voice cloning

I have no doubt this project would work wonders on Voice cloning.

StuartIanNaylor commented 3 years ago

With the fastspeech tflite model is it possible to covert to run on a Edge TPU? If so any examples how?

zero15 commented 3 years ago

Will tacotron2 support full integer quantization in tflite? Current model use full interger quantization failed with "pybind11::init(): factory function returned nullptr." It's likely because the model has multi subgraphs

ZDisket commented 3 years ago

@dathudeptrai Can you help with implementing forced alignment attention loss for Tacotron2 like in this paper? I've managed to turn MFA durations into alignments and put them in the dataloader, but replacing the regular guided attention loss only makes the attention learning worse, both finetuning and from scratch according to eval results after 1k steps, when in the paper the PAG one should be winning

dathudeptrai commented 3 years ago

@ZDisket let me read the paper first :D.

ZDisket commented 3 years ago

@dathudeptrai Since that post I discovered that MAE loss between the generated and forced attention works to guide it, but it's so strong that it ends up hurting performance, which could be fixed with a low enough multiplier like 0.01, although I haven't tested it extensively as I abandoned it in favor of training a universal vocoder with a trick. 1_alignment 2_alignment

alexdemartos commented 3 years ago

This looks really interesting:

https://arxiv.org/pdf/2106.03167v2.pdf

ZDisket commented 3 years ago

@tts-nlp That looks like an implementation of Algorithm 1. For the 2nd and third, they mention a shift time transform

In order to obtain the shift time transform, the convolution technique was applied after obtaining a DFT matrix or a fourier basis matrix in most implementations. OLA was applied to obtain the inverse transform

rgzn-aiyun commented 3 years ago

@dathudeptrai What do you think of voice cloning

Hey, I've seen a project about voice cloning recently.

rgzn-aiyun commented 3 years ago

This looks really interesting:

https://github.com/KuangDD/zhrtvc

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

abylouw commented 2 years ago

Anybody working on VQTTS?

qxde01 commented 1 year ago

I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model,then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it?

In addition, TensorFlowTTS project is not very active, not updated for more than a year.

StuartIanNaylor commented 1 year ago

I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model,then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it?

In addition, TensorFlowTTS project is not very active, not updated for more than a year.

Just been looking at wenet but haven't really made an appraisal but so far seems 'very kaldi' :)