[MMS TTS] - Can we change the speaker's voice (not language), without fine-tuning? Any controllable parameters, or seed?

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.16k stars 6.37k forks source link

[MMS TTS] - Can we change the speaker's voice (not language), without fine-tuning? Any controllable parameters, or seed? #5198

Open QaisarRajput opened 1 year ago

QaisarRajput commented 1 year ago

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I am using the MMS TTS and its amazing. So far for one language (eng) there is one speakers voice. Are there any parameters or random seeds which can be changed to have an entire different persons voice, without fine-tuning? Even if we cant do emotions or lets say voice pitch etc. but can it be done where we just have a random new naturally sounding person?

Code

What have you tried?

MMS TTS and Hugginface mms-tts

What's your environment?

fairseq Version (e.g., 1.0 or main): main
PyTorch Version (e.g., 1.0) - 1.13
OS (e.g., Linux): Linus
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source):
Python version: 3.10

QaisarRajput commented 1 year ago

JFYI, For now sampling rate is the only thing which can tune this a little, Higher gives you deeper voice (slower) while lower number give thinner voice (faster).

chevalierNoir commented 1 year ago

@QaisarRajput For now, controllable generation (e.g., change gender, emotion, etc) is not supported yet. You could consider cascading the MMS TTS model with an off-the-shelf voice cloning model to achieve this.

CopyNinja1999 commented 1 year ago

@QaisarRajput For now, controllable generation (e.g., change gender, emotion, etc) is not supported yet. You could consider cascading the MMS TTS model with an off-the-shelf voice cloning model to achieve this.

Could you please name one voice cloning repo on vits to achieve this? I find out that directly fine-tuning on Korean model makes very bad results.

chevalierNoir commented 1 year ago

Not sure how this would work, but here is one example for voice conversion.

khof312 commented 5 months ago

I suggest looking into Coqui which has recipes for using MMS-TTS (FairSeq) alongside voice cloning; I've used it successfully for gender.

Regarding emotion, etc. Bark looks promising, but I haven't tested it yet.

sansmoraxz commented 1 week ago

Bark seems to be very slow, albeit more powerful.