Closed ghost closed 3 years ago
Hey guys, just went through this thread quickly.
I indeed removed the ReLU layer in the voice encoder we use at Resemble.AI. I think the model on Resemblyzer still has it. I planned to release a new one which, among other things, wouldn't be trained on data that would have silences at the start and end of each clip.
I don't think I'll update the code in this repo, but I should update the code on Resemblyzer when the new model's released.
Hi, we are making a group effort on building a new dataset curated and cleaned, quality over quantity.
VoxCeleb is rejected due to it's horrible quality. VCTK might eventually have some bits in it. CommonVoice will be part of it. LibriTTS 100 / 360 / 500 will mostly be the base. (1st iteration)
Join the Slack for more info.
@mbdash what slack?
Anyone who wants to contribute in some way to the RTVC project is welcome to join the Slack. Leave a comment in #474 and we will provide an invite link.
VoxCeleb is rejected due to it's horrible quality.
The idea is to have a dataset with low quality though
Mozilla TTS is also developing a speaker encoder in https://github.com/mozilla/TTS/issues/512. I am inviting Mozilla TTS contributors to this discussion to see if we can decide on a common model structure. Also share thoughts on datasets and preprocessing techniques. In a best case situation we could even share the model.
A small update for anyone watching this thread. @steven and @blue-fish are doing some experimentation with training. I am currently cleaning up datasets to remove noise and artifacts from the source data used to train the models.
I saw @CorentinJ 's comment.
VoxCeleb is rejected due to it's horrible quality.
The idea is to have a dataset with low quality though
We are just playing around, putting our resources in common and experimenting to improve audio output quality as well as adding some punctuation support.
I am done with LibriTTS60/train-clean-100 and progressing through 360
@mbdash I believe corentin was referring to low quality audio (background noise, static, etc) being important while training the encoder. Clean audio is important for synthesis.
In the SV2TTS paper it is stated that "the audio quality [for encoder training] can be lower than for TTS training" but between this and the GE2E paper I have not seen a statement that it should be of lower quality. The noise might train the network to distinguish based on features that humans can perceive as opposed to subtler differences that can be found in clean audio. But I think it's worth running the experiment to see if this is truly the case.
@blue-fish I completely agree, running the experiment is the best option.
If anyone is wondering, training is still paused while @mbdash is denoising datasets and @steven850 is doing trial runs to determine best hparams for the encoder. It's a slow process. We plan to swap out the ReLU for a Tanh activation, but will try to match the model structure of the updated Resemblyzer encoder if it is released.
It's going to be a while before we get back to encoder training. I'm going to close this issue for now. Will reopen when we restart.
- VoxCeleb1: Dev A - D as well as the metadata file (extract as
VoxCeleb1/wav
andVoxCeleb1/vox1_meta.csv
)- VoxCeleb2: Dev A - H (extract as
VoxCeleb2/dev
)
Hmmm what kind of that i must to download ? *The site has a metadata VoxCeleb and audiofiles
In #126 it is mentioned that most of the ability to clone voices lies in the encoder. @mbdash is contributing a GPU to help train a better encoder model.
Instructions
LibriSpeech/train-other-500
)VoxCeleb1/wav
andVoxCeleb1/vox1_meta.csv
)VoxCeleb2/dev
)model_hidden_size
to 768 in encoder/params_model.pypython encoder_preprocess.py <datasets_root>
visdom
python encoder_train.py new_model_name <datasets_root>/SV2TTS/encoder