Enhancement: RVC .pth support

dnhkng / GlaDOS

This is the Personality Core for GLaDOS, the first steps towards a real-life implementation of the AI from the Portal series by Valve.

MIT License

2.95k stars 279 forks source link

Enhancement: RVC .pth support #53

Closed darkstorm2150 closed 5 months ago

darkstorm2150 commented 5 months ago

Wondering if its possible to have support for RVC, which uses .pth file for a voice, it would be a game changer, since custom voices can be trained quickly, .pth to .onnx conversion is a bit technical, unless there is a one click converter that I listed somewhere ?

dnhkng commented 5 months ago

Could you link to the repo for training?

dnhkng commented 5 months ago

Looked it up, and it seems RVC is speech-to-speech, and not text-to-speech, so its not usable for this project,

MithrilMan commented 5 months ago

I'm wondering how much latency would it add to do a tts + rvc

MithrilMan commented 5 months ago

I've found this reddit post: https://www.reddit.com/r/RASPBERRY_PI_PROJECTS/comments/1ciadap/make_any_voice_including_rvc_voices_into_an/

that lead to this repo: https://github.com/domesticatedviking/TextyMcSpeechy

didn't had time to look into it yet

dnhkng commented 5 months ago

I understand what this does (lets you generate a training set, and then train a VITS model), but it's too far out of scope for GLaDOS.

If you want to work on this, and can extract out the core code, I would be OK with a PR that adds in this functionality though.

MithrilMan commented 5 months ago

From your experience, how much time does it take to train a new language? when you trained glados you started from a checkpoint? I've a 3090ti, wondering how much it takes, maybe for training is better to rent some VM to not risk to burn the GPU

Btw I don't think that training a voice should be part of this repo, maybe a separate one

dnhkng commented 5 months ago

Training a new voice on a 3090 will take about a day, so it's no risk to your GPU. It will probably be pretty good overnight.

fonix232 commented 5 months ago

@dnhkng how much audio material does one need for a well trained model?

I've been thinking of making an alternative voice for SARAH but unlike GlaDOS there's no "clean" audio so most of it would be clippings from the TV show, meaning lots of interference from the soundtrack and such.

dnhkng commented 5 months ago

Hard to say. GLaDOS doesn't convey much emotion in her voice, so it was probably easier than a more human voice. I would start with about 30 minutes. Make sure it's varied, and had no background noises.

The first time I didn't remove contaminated samples, and the results were much worse. i.e. cutting about 15% of my training set produced a better final result. Quality over quantity!