Is it intended to be Zero-shot TTS

rishikksh20 commented 6 months ago

Hi @KdaiP nice work, just like to know is this architecture is intended to support zero-shot TTS or normal multi-speaker kind of TTS,

KdaiP commented 6 months ago

This architecture supports zero-shot text-to-speech (TTS) capabilities. However, its primary design goal is to achieve a lightweight and fast system. Therefore, the performance for unseen speakers cannot be guaranteed.

If we were to scale up the model to 1 billion parameters and train it on a dataset exceeding ten million hours, just like natural speech3, it might potentially enhance its zero-shot performance.

rishikksh20 commented 6 months ago

Completely agree with you. Just one more thing how the samples coming out so far ?

eschmidbauer commented 6 months ago

Thank you for sharing this project! i have a model training on a limited dataset and im getting decent results after a few hours of training. You mentioned "If we were to scale up the model to 1 billion parameters...", Can you elaborate how to scale up the model parameters? does that just mean a larger dataset?

KdaiP commented 6 months ago

Completely agree with you. Just one more thing how the samples coming out so far ?

I will release the pretrained checkpoints within one to two weeks.

Currently, I am fine-tuning the network structure in flow-matching. I've discovered that substituting some of the DiT with convolutional layers yields better results under a smaller parameter count and significantly accelerates convergence.

KdaiP commented 6 months ago

Thank you for sharing this project! i have a model training on a limited dataset and im getting decent results after a few hours of training. You mentioned "If we were to scale up the model to 1 billion parameters...", Can you elaborate how to scale up the model parameters? does that just mean a larger dataset?

In addition to expanding the dataset, scaling up model parameters involves increasing both the width and depth of the model. This can be achieved by modifying the ModelConfig in config.py. For example, you could set hidden_channels to 1024, filter_channels to 2048, and n_layers to 12.

rishikksh20 commented 6 months ago

@KdaiP I have been also training bit bigger model 72M params on LibriTTS (english) + Our own (Hindi) dataset (total around 800 hr) at a batch size of 8 without gradient accumulation. Till 72k get decent result at least listenable and understandable, but my main interest is unseen Zero-shot and emotion + prosody transfer from the prompt.

rishikksh20 commented 6 months ago

For me, the things that matter most are how well the model performs in multi-lingual form and how well it captures prosody from reference audio, especially cross-lingual prosody how well it transfers one lang. speaker prosody to others, for me speaker component and timbre are not that important as we can make any TTS a zero-shot by applying any VC model.

rishikksh20 commented 5 months ago

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

KdaiP commented 5 months ago

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Thank you for your interest in StableTTS! DDSP6.0 and ReFlow-VAE-SVC have already use reflow (which is very similar to flow-matching) to do voice conversion and have got decent results. I recommend checking out these two repositories for more information(≧▽≦).

r666ay commented 2 months ago

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated.

KdaiP commented 3 weeks ago

@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.

Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated.

Hi, we have released a new 31M model with bug fixes and audio quality improvement. It is much better than the 78M model I mentioned in the past.

KdaiP / StableTTS

Is it intended to be Zero-shot TTS #2