Open rishikksh20 opened 6 months ago
This architecture supports zero-shot text-to-speech (TTS) capabilities. However, its primary design goal is to achieve a lightweight and fast system. Therefore, the performance for unseen speakers cannot be guaranteed.
If we were to scale up the model to 1 billion parameters and train it on a dataset exceeding ten million hours, just like natural speech3, it might potentially enhance its zero-shot performance.
Completely agree with you. Just one more thing how the samples coming out so far ?
Thank you for sharing this project!
i have a model training on a limited dataset and im getting decent results after a few hours of training. You mentioned "If we were to scale up the model to 1 billion parameters...",
Can you elaborate how to scale up the model parameters? does that just mean a larger dataset?
Completely agree with you. Just one more thing how the samples coming out so far ?
I will release the pretrained checkpoints within one to two weeks.
Currently, I am fine-tuning the network structure in flow-matching. I've discovered that substituting some of the DiT with convolutional layers yields better results under a smaller parameter count and significantly accelerates convergence.
Thank you for sharing this project! i have a model training on a limited dataset and im getting decent results after a few hours of training. You mentioned
"If we were to scale up the model to 1 billion parameters...",
Can you elaborate how to scale up the model parameters? does that just mean a larger dataset?
In addition to expanding the dataset, scaling up model parameters involves increasing both the width and depth of the model. This can be achieved by modifying the ModelConfig
in config.py
. For example, you could set hidden_channels
to 1024, filter_channels
to 2048, and n_layers
to 12.
@KdaiP I have been also training bit bigger model 72M params on LibriTTS (english) + Our own (Hindi) dataset (total around 800 hr) at a batch size of 8 without gradient accumulation. Till 72k get decent result at least listenable and understandable, but my main interest is unseen Zero-shot and emotion + prosody transfer from the prompt.
For me, the things that matter most are how well the model performs in multi-lingual form and how well it captures prosody from reference audio, especially cross-lingual prosody how well it transfers one lang. speaker prosody to others, for me speaker component and timbre are not that important as we can make any TTS a zero-shot by applying any VC model.
@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.
@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.
Thank you for your interest in StableTTS! DDSP6.0 and ReFlow-VAE-SVC have already use reflow (which is very similar to flow-matching) to do voice conversion and have got decent results. I recommend checking out these two repositories for more information(≧▽≦).
@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.
Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated.
@KdaiP model seems powerful for my understanding, I have trained it on 1k hours of multi-lingual data with 78M params and it worked decently. I have a thought about weather this model will transform to do Speech to Speech Voice conversion, where we give input semantic token and target speaker latent and it converts semantic to target speaker's Vocos latent. We don't require any duration modeling at all. It's like one to one mapping.
Could you please tell me the detailed configuration of your 78M model, such as parameters in StableTTS/config.py? Very appreciated.
Hi, we have released a new 31M model with bug fixes and audio quality improvement. It is much better than the 78M model I mentioned in the past.
Hi @KdaiP nice work, just like to know is this architecture is intended to support zero-shot TTS or normal multi-speaker kind of TTS,