erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
686 stars 71 forks source link

Integration with RVC #229

Closed gshawn3 closed 1 month ago

gshawn3 commented 2 months ago

First of all, thank you for this wonderful project. I've been playing around with it this past week, both in standalone mode and as a text-generation-webui extension, and it's all working very well. The documentation is top-notch as well!

I noticed some lines of code mentioning "RVC Injection" here: https://github.com/erew123/alltalk_tts/blob/7c7cb72c134d66fbca2eea3445f8548d9c3bd985/system/st_files/index.js#L475

Is this working currently, or is that a feature that is still being worked on? I would really love to pass the generated audio through RVC because it makes voices sound 10x times more accurate, even after finetuning the XTTS model. If that's not currently possible, please consider adding this functionality in the future. Thanks, and good luck with the V2 update!

erew123 commented 2 months ago

Hi @gshawn3

That file is actually part of SillyTavern, its just there because I had to submit it with the alltalk SillyTavern PR submission originally. AKA, the code you are pointing out is nothing actually to do with AllTalk.

As for RVC, you mean using a Retrieval-based Voice Conversion model? (not XTTS models etc).

Thanks

gshawn3 commented 2 months ago

Ah that makes sense. Sorry I should probably have taken a closer look at the code before asking the question.

And yes, that's right! A common pipeline for inferencing AI voices is generating a sample with a fine-tuned XTTS / Tortoise / etc, and then routing that sample through RVC. It increases the likeness of any cloned voices by a huge amount. See for example the first 30 seconds of https://www.youtube.com/watch?v=IcpRfHod1ic.

bobcate commented 2 months ago

I just set up the RVC so I can have Expression-Based Dynamic Voice and only then I wondered; I'm using AllTalk with both narator and character, how is it even going to work with more than one voice?πŸ˜„

And it worked! But both the narrator and character voices combined into one. It's still good though. The voice coming from RVC belongs to the same person(model), but the intonation is different between narration text and dialogue text, very much usable: rvc-io.zip

erew123 commented 2 months ago

@gshawn3 @bobcate I think it looks like something I could include. It mostly just seems to be another layer of transcoding to add, and Ive already added in transcoding to 5x audio types into v2.... so I think it would be possible.

As you are both using it already, Im just trying to wrap my head around a few bits as it would help me code/build something in the future. Are you actually using a custom RVC model, like a finetuned on with the conversation you do? or is it using the base models? (these ones I think hubert_base.pt and rmvpe.pt) and/or is there a need to select different rvc models when doing this process?

I was intending on adding just base RVC models anyway, but obviously what you are describing above is a pipeline and slightly different to loading a model in.

Thanks

bobcate commented 2 months ago

I'm using the base model(s). There is hubert_base.pt(renames to hubert_rvc.pt)(185mb) and rmvpe.pt(176mb). I don't know technically but from my observation:

I hope you can add it and we can use RVC on both(narrator+character) voices seperately. Even if that doesn't work, it'd still be better to have it integrated in AllTalk because we would be using one less extension and not need the ST-extras server.

Dolyfin commented 2 months ago

2 months ago I have trained a xttsv2 model using alltalk_tts in replacement of a regular TTS>RVC2 system.

If you have the capability of finetuning on 2 medium to large dataset, you can theoretically get optimal results with 2 or even more voices by just swapping out the reference clone audio.

Finetune model with a dataset of character 1 and 2. Inference with audio sample of character 1 or 2 (setting them as character or narrator).

The above isnt a solution for the majority that cant train/finetune their own models (TTS or RVC). However if you use all_talk for the api, you likely can just use the api of alltalk into the RVC gradio api in the webui to achieve similar results. The latency will be huge though.

Mixomo commented 2 months ago

I join for this RVC request in standalone mode! πŸ™πŸ™πŸ™ looking forward to this thread, thanks in advance!

gshawn3 commented 1 month ago

@gshawn3 @bobcate I think it looks like something I could include. It mostly just seems to be another layer of transcoding to add, and Ive already added in transcoding to 5x audio types into v2.... so I think it would be possible.

As you are both using it already, Im just trying to wrap my head around a few bits as it would help me code/build something in the future. Are you actually using a custom RVC model, like a finetuned on with the conversation you do? or is it using the base models? (these ones I think hubert_base.pt and rmvpe.pt) and/or is there a need to select different rvc models when doing this process?

I was intending on adding just base RVC models anyway, but obviously what you are describing above is a pipeline and slightly different to loading a model in.

Thanks

Sorry for the late reply. Personally, I do use a finetuned model. Training a voice in RVC only takes 15-20 minutes on consumer hardware, and it makes a huge difference in the quality of the output. I actually used the exact same dataset to train RVC that I used when finetuning XTTSv2 with AllTalkTTS. The output from XTTSv2 + RVC is basically indistinguishable from my real voice. Thanks again for looking into this!

erew123 commented 1 month ago

I have a good news/bad news type scenario. Probably best to start with the good news......

So I looked further into RVC and Ive written enough code to in theory load a model and process it with RVC. Writing that and figuring it out, took a good while...... Thats the good news.

Bad news is there is a compatibility issue with Fairseq (a thing you need to install that loads the hubert model) and various versions of Python.

So I found someone who has written an updated version of Fariseq https://github.com/VarunGumma/fairseq and it wont compile (Ill explain more in a moment).

Also, there is feature request with the RVC-Project https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/2036 to move to Fairseq2 (It may be possible I can do this in some way myself.. but ill need a bit more time on this.)

So back to the updated Fariseq I found, it wont compile (at least on Windows) because:

OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\useraccount\AppData\Local\Temp\pip-build-env-gitiuk5h\overlay\Lib\site-packages\torch\lib\shm.dll" or one of its dependencies.

Which is one hell of a rabbit hole....

https://github.com/pytorch/pytorch/issues/125109 https://github.com/facebookresearch/fairseq/issues/5012

etc........

The long and short of that is that somewhere in many Python versions from 3.11 upwards a bug was introduced with certain versions of PyTorch that causes this issue where its not always looking in the correct places for the files required to compile things id they use or need to access MKL libraries.

From my research, its something that the PyTorch developers are looking at, but is not resolved yet, so that leaves a question of what versions of Python will they fix it for? What will the version of PyTorch be, as that can result in other dependency conflicts (things like DeepSpeed), When might other things like Text-gen-webui upgrade its Pytorch version to a working Pytorch version (and how to deal with any Python versions that may not be supported).... and a whole host of other questions.

So Im not saying this is dead in the water, after all, I have written the base code to at least attempt it. But what I am saying is that Ill take a look at Fairseq2 and if theres no dice there, then its really down to PyTorch and when their developers say they have fixed it and what the resulting mess of that looks like.

As such, Im going to close the ticket for now, put it in the Features Requests and its going to be one of those I keep an eye on.

erew123 commented 1 month ago

Scrub the Fairseq2 option...

image

No Windows support and a mess on other OS's

gshawn3 commented 1 month ago

Thank you for looking into it.

I'm not sure if this could be helpful, but here is another TTS project that somewhat recently added integration with RVC. Looks like they just import is as a submodule, and after that there's not a ton of code needed for inference:

https://github.com/JarodMica/ai-voice-cloning/commit/b7879cc8dc9b426087d43d33985cd3f64068f35f

Mixomo commented 1 month ago

Hey, here are other project of XTTS with webui and RVC.

https://github.com/daswer123/xtts-webui

It includes many very useful aspects. Such a model switcher among others...

erew123 commented 1 month ago

Hi Everyone..

Bad news/good news.....

Bad news, I looked over lots of other projects that use RVC and they all have the same problems with Python versions, Pytorch versions, Fairseq, compiling the loader etc and so on and so forth. Honestly, the whole thing is a rabbit hole! Even some of the premier projects out there are all dropping back to Python 3.9,... there literally is no way around it and its a mess.............................

Unless you spend 20ish hours re-writing ALL the model loaders, handling, etc etc...... and the code is damn complicated! But....

image

Probably another 20+hours to tidy up all the code and actually integrate it, setup downloaders/model management etc.. and hope that I can get all the requirements files to work correctly! What you see on the screen below is just a test interface that I was using to debug/try make it work.

And just to be clear, this will be a Alltalk v2 feature.... I still have a few bits to do before I can release that on BETA..... so watch this space.

It handles all the methods though.

image

erew123 commented 1 month ago

Apologies for my spelling mistakes on that last post... Ive been far too long looking at code to spell correctly!

Mixomo commented 1 month ago

@erew123 Ooh, don't worry, and of course, no rush! You're doing all of this out of love! Do you have a tip jar? I'd love to support you on this wonderful project as soon as I can!

311-code commented 1 month ago

Edit: Just noticed your screenshot and comments in bottom of post above, not sure if my comment below still relevant but I can try to help out with the coding stuff if you need it, just let me know. Here is direct link to the repo that converts xtts to rvc

Original comment: Not sure if this would help, but it seems like this guy is manually putting a trained coqui 2.0.2 output into an RVC webui at the end of the video Would it be possible to have the all_talk extension somehow just send the .mp3 to a separate all_talk RVC extension based on the code from the repo he is using?

It uses Python 3.10.6 I think. I'm guessing you are already trying to do this as you mentioned the autoloader stuff, just making sure. RVC does seem to really improve the results at the end of that video.

Maybe after alltalk_tts makes the .mp3 then it can use some of the code from this repo and additional code to auto pull the mp3 and runs it though a "alltalk RVC" sort of thing. I could try to help with this and pickup where you left off. Did you upload the files somewhere? And not sure what specific files you were working on.

erew123 commented 1 month ago

@Mixomo If you want to tip you are welcome to. There is a Kofi link on the front page of Github on the right hand side.

@brentjohnston You would be welcome to go through the V2 when I have it up. I'm happy to take criticism or additions to the code :) The re-write Ive done of RVC in theory may allow it to work on Python 3.12 and 3.13 now.... though youll need to manually compile fairseq for it, but it shouldn't bug out/error like all other versions do past Python 3.11.......Though I am still mid development, so who knows, something yet may happen to make me swallow my words!

Still.. I've at least made a settings interface for RVC....now! Now on with the chunk of work to make the rest work (model downloaders/management, API calls, logic trees for identifying when/where to run RVC calls, making the narrator function work with it, large tidy up of the code still. And thats just RVC).

image

311-code commented 1 month ago

That looks great! Yeah I trained star trek tng computer voice and made an interface for it, I was fairly happy with the voice and alltalk finetune. But after putting the .mp3's through extra RVC finetune I'm now mindblown of the accuracy.

It beats my elevenlabs v2 version for sure. I'm using dragon naturally speaking for it to type for me and I have dragon "custom voice command" where I just say "make it so" and it presses enter for me.

https://old.reddit.com/r/Oobabooga/comments/1bj7tx4/guide_the_easiest_way_to_modify_oobabooga_colors/ lcars-oobabooga

Can't wait to see RVC added, thanks for working on this.

erew123 commented 1 month ago

@brentjohnston Hey that's awesome work there! I bet that took some time?

You asked about merging models..... Well, humm. Its not something Coqui supported or that I can find anyone ever having done, but, saying that, in theory, if you had 2x finetunes and they were built from the same base model e.g. 2.0.2 or 2.0.3, then in theory, it should be possible. I just have no idea what the resulting outcome would be! And the merged model (if it works) may need a very small, very low learning rate extra finetune with both sets of training data after just to re bump the merged weights for the finetuned data back up.

So a VERY VERY hypothetical script would look like this: (I don't think I can specify VERY VERY enough there)

import os
import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import gradio as gr

try:
    import deepspeed
    deepspeed_enabled = True
except ImportError:
    deepspeed_enabled = False
    pass

device = "cuda" if torch.cuda.is_available() else "cpu"

def merge_models(model1, model2, merge_type='weighted_avg', alpha=0.5):
    """
    Merge two XTTS models using different merging techniques.

    Args:
        model1 (Xtts): Path to first XTTS model.
        model2 (Xtts): Path to second XTTS model.
        merge_type (str): The type of merging technique to use. Options: 'weighted_avg', 'interpolate'.
        alpha (float): The weight factor for the weighted average or interpolation.

    Returns:
        Xtts: The merged XTTS model.
    """
    merged_model = Xtts(model1.config)  # Create a new instance of the Xtts class

    if merge_type == 'weighted_avg':
        # Perform weighted average of model parameters
        for param1, param2, param_merged in zip(model1.parameters(), model2.parameters(), merged_model.parameters()):
            param_merged.data = alpha * param1.data + (1 - alpha) * param2.data
    elif merge_type == 'interpolate':
        # Perform interpolation of model parameters
        for param1, param2, param_merged in zip(model1.parameters(), model2.parameters(), merged_model.parameters()):
            param_merged.data = torch.lerp(param1.data, param2.data, alpha)
    else:
        raise ValueError(f"Invalid merge type: {merge_type}")

    return merged_model

def merge_models_interface(model1_path, model2_path, merge_type, alpha, output_path):
    # Load model1
    config1 = XttsConfig()
    config1_path = model1_path / "config.json"
    vocab1_path_dir = model1_path / "vocab.json"
    checkpoint1_dir = model1_path
    config1.load_json(str(config1_path))
    model1 = Xtts.init_from_config(config1)
    model1.load_checkpoint(
        config1,
        checkpoint_dir=str(checkpoint1_dir),
        vocab_path=str(vocab1_path_dir),
        use_deepspeed=deepspeed_enabled,
    )
    model1.to(device)

    # Load model2
    config2 = XttsConfig()
    config2_path = model2_path / "config.json"
    vocab2_path_dir = model2_path / "vocab.json"
    checkpoint2_dir = model2_path
    config2.load_json(str(config2_path))
    model2 = Xtts.init_from_config(config2)
    model2.load_checkpoint(
        config2,
        checkpoint_dir=str(checkpoint2_dir),
        vocab_path=str(vocab2_path_dir),
        use_deepspeed=deepspeed_enabled,
    )
    model2.to(device)

    # Merge models
    merged_model = merge_models(model1, model2, merge_type=merge_type, alpha=alpha)

    # Create the output directory if it doesn't exist
    output_dir = os.path.dirname(output_path)
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Save the merged model
    merged_model.save_checkpoint(output_path)

    return f"Merged model saved at: {output_path}"

iface = gr.Interface(
    fn=merge_models_interface,
    inputs=[
        gr.components.Textbox(label="Path to Model 1", value="c:\mymodels\mymodelfolder_1"),
        gr.components.Textbox(label="Path to Model 2", value="c:\mymodels\mymodelfolder_2"),
        gr.components.Radio(["weighted_avg", "interpolate"], value="weighted_avg", label="Merge Type"),
        gr.components.Slider(minimum=0, maximum=1, step=0.1, value=0.5, label="Alpha"),
        gr.components.Textbox(label="Output Path", value="c:\mymodels\mymodeloutputfolder"),
    ],
    outputs=gr.components.Textbox(label="Result"),
    title="Merge XTTS Models",
    description="Merge two XTTS models using different merging techniques.",
)

if __name__ == "__main__":
    iface.launch()

This would obviously have to be loaded in the AllTalk python environment too (or one that has TTS built in). FYI, I have not debugged, tested, tried this script etc. I honestly cannot say what the resulting model would be like or if it will work for sure.

But if you wanted to give it a go..... you could. Umm I would copy some model folders somewhere safe and play around that way. It will create the output folder path specified, if it doesnt exist.

No idea what merging rates would be good, but 0.5 (50/50) probably would be most sensible.

Its highly possible though that you end up with a model that just doesnt work. It could take a lot of testing to figure out if this process works or if there are other things that need doing to make it work. And who knows, there may be some real quirks of Coqui's models that requires a really deep dive into their code to figure out.

This is the start of the source code for XTTS:

https://docs.coqui.ai/en/latest/_modules/TTS/tts/models/xtts.html#

311-code commented 1 month ago

Thanks for the info! I gave it a shot and this is what I came up with but sadly some errors I can't get past, on my phone and forgot the specific error. I even tried brute force interpolating the keys to see if it was theoretically possibly to blend a 2.0.2 model with 2.0.3 somehow. Not sure if this code below is useful to you, I'm going to go through your code tonight and try some of it out.

The reason I wanted to try this our is because in a gradio rvc interface I'm using, It has a merge model area, and it definitely increased the quality a lot for me when I blend two trained rvc models. I also notice this on finetunes I've on SDXL when merging two trained subjects. So figured it might be similar thing.

Thanks again Brent

Merge.py

import os import torch import json from TTS.tts.models.xtts import Xtts from TTS.tts.configs.xtts_config import XttsConfig

Define paths

model1_path = "C:/models/xtts2/model.pth" config1_path = "C:/models/xtts2/config.json" model2_path = "C:/models/xtts/model.pth" config2_path = "C:/models/xtts/config.json"

def load_config(config_path): with open(config_path, 'r') as f: return json.load(f)

def align_and_interpolate_configs(config1, config2): merged_config = {} for key in set(config1.keys()).union(config2.keys()): if key in config1 and key in config2: value1, value2 = config1[key], config2[key] if isinstance(value1, (int, float)) and isinstance(value2, (int, float)): merged_config[key] = (value1 + value2) / 2 else: merged_config[key] = value1 if key in config1 else value2 elif key in config1: merged_config[key] = config1[key] else: merged_config[key] = config2[key] return merged_config

def merge_weights(model1, model2): merged_state_dict = model1.state_dict() for key in merged_state_dict.keys(): if key in model2.state_dict(): merged_state_dict[key] = (model1.state_dict()[key] + model2.state_dict()[key]) / 2 else: print(f"Warning: Key {key} missing in model2, using value from model1.") return merged_state_dict

def save_merged_model(model, config, save_dir): os.makedirs(save_dir, exist_ok=True) model_path = os.path.join(save_dir, "merged_model.pth") config_path = os.path.join(save_dir, "merged_config.json")

torch.save(model.state_dict(), model_path)
with open(config_path, 'w') as f:
    json.dump(config, f)

def main(): config1 = load_config(config1_path) config2 = load_config(config2_path)

merged_config_dict = align_and_interpolate_configs(config1, config2)

# Correctly initialize the merged configuration
merged_config = XttsConfig()
for key, value in merged_config_dict.items():
    setattr(merged_config, key, value)

try:
    model1 = Xtts(merged_config)
    model2 = Xtts(merged_config)
except TypeError as e:
    print(f"Error in model initialization: {e}")
    print("Are you trying to merge models with significantly different configurations?")
    return

model1.load_state_dict(torch.load(model1_path, map_location=torch.device('cpu')))
model2.load_state_dict(torch.load(model2_path, map_location=torch.device('cpu')))

merged_state_dict = merge_weights(model1, model2)

merged_model = Xtts(merged_config)
merged_model.load_state_dict(merged_state_dict)

save_merged_model(merged_model, merged_config_dict, "C:/models/merged")

if name == "main": main()


From: erew123 @.> Sent: Friday, May 24, 2024 4:33 AM To: erew123/alltalk_tts @.> Cc: brentjohnston @.>; Mention @.> Subject: Re: [erew123/alltalk_tts] Integration with RVC (Issue #229)

@brentjohnstonhttps://github.com/brentjohnston Hey that's awesome work there! I bet that took some time?

You asked about merging models..... Well, humm. Its not something Coqui supported or that I can find anyone ever having done, but, saying that, in theory, if you had 2x finetunes and they were built from the same base model e.g. 2.0.2 or 2.0.3, then in theory, it should be possible. I just have no idea what the resulting outcome would be! And the merged model (if it works) may need a very small, very low learning rate extra finetune with both sets of training data after just to re bump the merged weights for the finetuned data back up.

So a VERY VERY hypothetical script would look like this: (I don't think I can specify VERY VERY enough there)

import os import torch from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts import gradio as gr

try: import deepspeed deepspeed_enabled = True except ImportError: deepspeed_enabled = False pass

device = "cuda" if torch.cuda.is_available() else "cpu"

def merge_models(model1, model2, merge_type='weighted_avg', alpha=0.5): """ Merge two XTTS models using different merging techniques.

Args:
    model1 (Xtts): Path to first XTTS model.
    model2 (Xtts): Path to second XTTS model.
    merge_type (str): The type of merging technique to use. Options: 'weighted_avg', 'interpolate'.
    alpha (float): The weight factor for the weighted average or interpolation.

Returns:
    Xtts: The merged XTTS model.
"""
merged_model = Xtts(model1.config)  # Create a new instance of the Xtts class

if merge_type == 'weighted_avg':
    # Perform weighted average of model parameters
    for param1, param2, param_merged in zip(model1.parameters(), model2.parameters(), merged_model.parameters()):
        param_merged.data = alpha * param1.data + (1 - alpha) * param2.data
elif merge_type == 'interpolate':
    # Perform interpolation of model parameters
    for param1, param2, param_merged in zip(model1.parameters(), model2.parameters(), merged_model.parameters()):
        param_merged.data = torch.lerp(param1.data, param2.data, alpha)
else:
    raise ValueError(f"Invalid merge type: {merge_type}")

return merged_model

def merge_models_interface(model1_path, model2_path, merge_type, alpha, output_path):

Load model1

config1 = XttsConfig()
config1_path = model1_path / "config.json"
vocab1_path_dir = model1_path / "vocab.json"
checkpoint1_dir = model1_path
config1.load_json(str(config1_path))
model1 = Xtts.init_from_config(config1)
model1.load_checkpoint(
    config1,
    checkpoint_dir=str(checkpoint1_dir),
    vocab_path=str(vocab1_path_dir),
    use_deepspeed=deepspeed_enabled,
)
model1.to(device)

# Load model2
config2 = XttsConfig()
config2_path = model2_path / "config.json"
vocab2_path_dir = model2_path / "vocab.json"
checkpoint2_dir = model2_path
config2.load_json(str(config2_path))
model2 = Xtts.init_from_config(config2)
model2.load_checkpoint(
    config2,
    checkpoint_dir=str(checkpoint2_dir),
    vocab_path=str(vocab2_path_dir),
    use_deepspeed=deepspeed_enabled,
)
model2.to(device)

# Merge models
merged_model = merge_models(model1, model2, merge_type=merge_type, alpha=alpha)

# Create the output directory if it doesn't exist
output_dir = os.path.dirname(output_path)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save the merged model
merged_model.save_checkpoint(output_path)

return f"Merged model saved at: {output_path}"

iface = gr.Interface( fn=merge_models_interface, inputs=[ gr.components.Textbox(label="Path to Model 1", value="c:\mymodels\mymodelfolder_1"), gr.components.Textbox(label="Path to Model 2", value="c:\mymodels\mymodelfolder_2"), gr.components.Radio(["weighted_avg", "interpolate"], value="weighted_avg", label="Merge Type"), gr.components.Slider(minimum=0, maximum=1, step=0.1, value=0.5, label="Alpha"), gr.components.Textbox(label="Output Path", value="c:\mymodels\mymodeloutputfolder"), ], outputs=gr.components.Textbox(label="Result"), title="Merge XTTS Models", description="Merge two XTTS models using different merging techniques.", )

if name == "main": iface.launch()

This would obviously have to be loaded in the AllTalk python environment too (or one that has TTS built in). FYI, I have not debugged, tested, tried this script etc. I honestly cannot say what the resulting model would be like or if it will work for sure.

But if you wanted to give it a go..... you could. Umm I would copy some model folders somewhere safe and play around that way. It will create the output folder path specified, if it doesnt exist.

No idea what merging rates would be good, but 0.5 (50/50) probably would be most sensible.

Its highly possible though that you end up with a model that just doesnt work. It could take a lot of testing to figure out if this process works or if there are other things that need doing to make it work. And who knows, there may be some real quirks of Coqui's models that requires a really deep dive into their code to figure out.

This is the start of the source code for XTTS:

https://docs.coqui.ai/en/latest/_modules/TTS/tts/models/xtts.html#

β€” Reply to this email directly, view it on GitHubhttps://github.com/erew123/alltalk_tts/issues/229#issuecomment-2129073433, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUH6WXVIJJF7A53URYDIXTZD4CNJAVCNFSM6AAAAABHZ6LNXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGA3TGNBTGM. You are receiving this because you were mentioned.Message ID: @.***>

311-code commented 1 month ago

I mistyped, I meant to say increased likeness in sdxl "when training the same subject on two different sdxl models and merging them"


From: Brent Johnson @.> Sent: Friday, May 24, 2024 7:57 AM To: erew123/alltalk_tts @.> Subject: Re: [erew123/alltalk_tts] Integration with RVC (Issue #229)

Thanks for the info! I gave it a shot and this is what I came up with but sadly some errors I can't get past, on my phone and forgot the specific error. I even tried brute force interpolating the keys to see if it was theoretically possibly to blend a 2.0.2 model with 2.0.3 somehow. Not sure if this code below is useful to you, I'm going to go through your code tonight and try some of it out.

The reason I wanted to try this our is because in a gradio rvc interface I'm using, It has a merge model area, and it definitely increased the quality a lot for me when I blend two trained rvc models. I also notice this on finetunes I've on SDXL when merging two trained subjects. So figured it might be similar thing.

Thanks again Brent

Merge.py

import os import torch import json from TTS.tts.models.xtts import Xtts from TTS.tts.configs.xtts_config import XttsConfig

Define paths

model1_path = "C:/models/xtts2/model.pth" config1_path = "C:/models/xtts2/config.json" model2_path = "C:/models/xtts/model.pth" config2_path = "C:/models/xtts/config.json"

def load_config(config_path): with open(config_path, 'r') as f: return json.load(f)

def align_and_interpolate_configs(config1, config2): merged_config = {} for key in set(config1.keys()).union(config2.keys()): if key in config1 and key in config2: value1, value2 = config1[key], config2[key] if isinstance(value1, (int, float)) and isinstance(value2, (int, float)): merged_config[key] = (value1 + value2) / 2 else: merged_config[key] = value1 if key in config1 else value2 elif key in config1: merged_config[key] = config1[key] else: merged_config[key] = config2[key] return merged_config

def merge_weights(model1, model2): merged_state_dict = model1.state_dict() for key in merged_state_dict.keys(): if key in model2.state_dict(): merged_state_dict[key] = (model1.state_dict()[key] + model2.state_dict()[key]) / 2 else: print(f"Warning: Key {key} missing in model2, using value from model1.") return merged_state_dict

def save_merged_model(model, config, save_dir): os.makedirs(save_dir, exist_ok=True) model_path = os.path.join(save_dir, "merged_model.pth") config_path = os.path.join(save_dir, "merged_config.json")

torch.save(model.state_dict(), model_path)
with open(config_path, 'w') as f:
    json.dump(config, f)

def main(): config1 = load_config(config1_path) config2 = load_config(config2_path)

merged_config_dict = align_and_interpolate_configs(config1, config2)

# Correctly initialize the merged configuration
merged_config = XttsConfig()
for key, value in merged_config_dict.items():
    setattr(merged_config, key, value)

try:
    model1 = Xtts(merged_config)
    model2 = Xtts(merged_config)
except TypeError as e:
    print(f"Error in model initialization: {e}")
    print("Are you trying to merge models with significantly different configurations?")
    return

model1.load_state_dict(torch.load(model1_path, map_location=torch.device('cpu')))
model2.load_state_dict(torch.load(model2_path, map_location=torch.device('cpu')))

merged_state_dict = merge_weights(model1, model2)

merged_model = Xtts(merged_config)
merged_model.load_state_dict(merged_state_dict)

save_merged_model(merged_model, merged_config_dict, "C:/models/merged")

if name == "main": main()


From: erew123 @.> Sent: Friday, May 24, 2024 4:33 AM To: erew123/alltalk_tts @.> Cc: brentjohnston @.>; Mention @.> Subject: Re: [erew123/alltalk_tts] Integration with RVC (Issue #229)

@brentjohnstonhttps://github.com/brentjohnston Hey that's awesome work there! I bet that took some time?

You asked about merging models..... Well, humm. Its not something Coqui supported or that I can find anyone ever having done, but, saying that, in theory, if you had 2x finetunes and they were built from the same base model e.g. 2.0.2 or 2.0.3, then in theory, it should be possible. I just have no idea what the resulting outcome would be! And the merged model (if it works) may need a very small, very low learning rate extra finetune with both sets of training data after just to re bump the merged weights for the finetuned data back up.

So a VERY VERY hypothetical script would look like this: (I don't think I can specify VERY VERY enough there)

import os import torch from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts import gradio as gr

try: import deepspeed deepspeed_enabled = True except ImportError: deepspeed_enabled = False pass

device = "cuda" if torch.cuda.is_available() else "cpu"

def merge_models(model1, model2, merge_type='weighted_avg', alpha=0.5): """ Merge two XTTS models using different merging techniques.

Args:
    model1 (Xtts): Path to first XTTS model.
    model2 (Xtts): Path to second XTTS model.
    merge_type (str): The type of merging technique to use. Options: 'weighted_avg', 'interpolate'.
    alpha (float): The weight factor for the weighted average or interpolation.

Returns:
    Xtts: The merged XTTS model.
"""
merged_model = Xtts(model1.config)  # Create a new instance of the Xtts class

if merge_type == 'weighted_avg':
    # Perform weighted average of model parameters
    for param1, param2, param_merged in zip(model1.parameters(), model2.parameters(), merged_model.parameters()):
        param_merged.data = alpha * param1.data + (1 - alpha) * param2.data
elif merge_type == 'interpolate':
    # Perform interpolation of model parameters
    for param1, param2, param_merged in zip(model1.parameters(), model2.parameters(), merged_model.parameters()):
        param_merged.data = torch.lerp(param1.data, param2.data, alpha)
else:
    raise ValueError(f"Invalid merge type: {merge_type}")

return merged_model

def merge_models_interface(model1_path, model2_path, merge_type, alpha, output_path):

Load model1

config1 = XttsConfig()
config1_path = model1_path / "config.json"
vocab1_path_dir = model1_path / "vocab.json"
checkpoint1_dir = model1_path
config1.load_json(str(config1_path))
model1 = Xtts.init_from_config(config1)
model1.load_checkpoint(
    config1,
    checkpoint_dir=str(checkpoint1_dir),
    vocab_path=str(vocab1_path_dir),
    use_deepspeed=deepspeed_enabled,
)
model1.to(device)

# Load model2
config2 = XttsConfig()
config2_path = model2_path / "config.json"
vocab2_path_dir = model2_path / "vocab.json"
checkpoint2_dir = model2_path
config2.load_json(str(config2_path))
model2 = Xtts.init_from_config(config2)
model2.load_checkpoint(
    config2,
    checkpoint_dir=str(checkpoint2_dir),
    vocab_path=str(vocab2_path_dir),
    use_deepspeed=deepspeed_enabled,
)
model2.to(device)

# Merge models
merged_model = merge_models(model1, model2, merge_type=merge_type, alpha=alpha)

# Create the output directory if it doesn't exist
output_dir = os.path.dirname(output_path)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save the merged model
merged_model.save_checkpoint(output_path)

return f"Merged model saved at: {output_path}"

iface = gr.Interface( fn=merge_models_interface, inputs=[ gr.components.Textbox(label="Path to Model 1", value="c:\mymodels\mymodelfolder_1"), gr.components.Textbox(label="Path to Model 2", value="c:\mymodels\mymodelfolder_2"), gr.components.Radio(["weighted_avg", "interpolate"], value="weighted_avg", label="Merge Type"), gr.components.Slider(minimum=0, maximum=1, step=0.1, value=0.5, label="Alpha"), gr.components.Textbox(label="Output Path", value="c:\mymodels\mymodeloutputfolder"), ], outputs=gr.components.Textbox(label="Result"), title="Merge XTTS Models", description="Merge two XTTS models using different merging techniques.", )

if name == "main": iface.launch()

This would obviously have to be loaded in the AllTalk python environment too (or one that has TTS built in). FYI, I have not debugged, tested, tried this script etc. I honestly cannot say what the resulting model would be like or if it will work for sure.

But if you wanted to give it a go..... you could. Umm I would copy some model folders somewhere safe and play around that way. It will create the output folder path specified, if it doesnt exist.

No idea what merging rates would be good, but 0.5 (50/50) probably would be most sensible.

Its highly possible though that you end up with a model that just doesnt work. It could take a lot of testing to figure out if this process works or if there are other things that need doing to make it work. And who knows, there may be some real quirks of Coqui's models that requires a really deep dive into their code to figure out.

This is the start of the source code for XTTS:

https://docs.coqui.ai/en/latest/_modules/TTS/tts/models/xtts.html#

β€” Reply to this email directly, view it on GitHubhttps://github.com/erew123/alltalk_tts/issues/229#issuecomment-2129073433, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFUH6WXVIJJF7A53URYDIXTZD4CNJAVCNFSM6AAAAABHZ6LNXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGA3TGNBTGM. You are receiving this because you were mentioned.Message ID: @.***>

erew123 commented 1 month ago

@brentjohnston Im not sure a 2,0.2 model would blend with a 2,0,3 model. The reason being their introduced 3x entire new languages in 2,0,3 that didnt exist in 2,0,2 (one of those languages, Hindi, is undocumented) So the underlying dataset of the models isnt the same to potentially a large extent along with the models config.json's. As such I dont know how well that would work out in a merge of 2x finetuned models. Im not sure it would be able to pick up on just the finetuned part as being the differences to merge. At least, thats my kind of rough thoughts in my head. Aka, it might be like trying to merge a SD 1.5 model with SD 2.0 model, they are just too far away for it to work.

erew123 commented 1 month ago

Everyone will be happy to hear that... RVC is working, with the Narrator function now. I can tell you all it kicked my ass getting this working. Beyond the initial re-write or RVC to get it working with Python 3.11+ I then spend 8-10 hours figuring out how to deal with variable model index sizes and get the best quality vs performance (aka, lots of complicated maths im not too sure I understand). However, I got there in the end with that and Ive actually setup a whole new RVC feature. It gives you a payoff between performance VS quality.

image

And after that, integrating it with the narrator and dealing with coding fallouts from that (plus deleting 1x small line of code that screwed the thing up for 1 hour) took me the best part of 7 hours. (I hate having to re-write and work with the narrator on things, it gets so damn complicated). But......... its done... it works... Im going to make a backup of the code before some accident happens.

So thats a very big thing out the way. Hopefully lots of the other bits I want to do are smooth going and I can get a beta out soon (Will probably have only tested on Windows by that point).

image

Dolyfin commented 1 month ago

Thank you for the hard work. Although I can’t fathom how much VRAM this would need to use locally with just one GPU as I remember it using 5GB+ in the RVC webui. Might be able to even quantise the models although I’ve yet to see anyone do that with xtts and RVC.

Would love to see some early latency testing and vram usage.

erew123 commented 1 month ago

@Dolyfin As I have so much code to punch my way through at the moment, For now I'm allocating my time to that and getting something out for people to test/try out. Saying that, RVC seemed to add about an extra 1GB to 1.3GB on VRAM use. The first RVC generation (see the images above) is slower if its loading in a new .pth file into VRAM (adds a short load time). The training index size I have introduced impacts generation time (you can see Index size used listed on the above image). Best I understand, different .pth and their associated *.index file (if you are using index files), well, the indexes are all different sizes in length. Some indexes could be 20,000 long, some 80,000 long. The more indexing you use, the better the sample generation, but the more processing time required, hence Ive given everyone the option to set what they like and that will affect latency.

Hope that gives some information.

Thanks

erew123 commented 1 month ago

@Dolyfin Ive manage to have a brief moment to do something to show you how the indexing affects generation times etc.

I know the index on this file is about 76000 (the most it can be indexed). Others I have, have indexes of 40,000 ish. It varies by file. The indexing function Ive introduced says "use a maximum amount of the index", hence setting it at 20,000, 40,000, 60,000 and 80,000 (being over the amount it can be indexed by). You can see a relatively linear path of, the more you index, the longer it takes, but the higher quality the end result. Im not yet sure id there is a good middle ground. But somewhere above 30,000 Im not noticing too much difference, unless you wanted studio quality audio.

image

ibrah3m commented 4 weeks ago

Is there any updates regarding this? a work around like XTTS-RVC

can't we make plugin to pipeline to Applio(rvc)? they did it with elevenlabs by using the api to generate the TTS then inference with applio rvc

is there a discord server for this project im deadly like it and hope to find workaround ASAP.

also I couldn't really finetune my audios the progress wasn't progressing..

gshawn3 commented 4 weeks ago

@ibrah3m Indeed there is. RVC integration has been implemented in the upcoming AllTalk TTS V2. I had a chance to test it briefly and it works great. Check out #245.