Deep Clone and generation longer than 12s

Camb-ai / MARS5-TTS

MARS5 speech model (TTS) from CAMB.AI

https://www.camb.ai

GNU Affero General Public License v3.0

1.37k stars 95 forks source link

Deep Clone and generation longer than 12s #24

Open skirdey opened 1 week ago

skirdey commented 1 week ago

Currently, when using deep cloning, and maybe when not - the model starts producing artifacts after 12 seconds of total new audio generation. Was wondering if it is expected for current model checkpoint, or needs further troubleshooting?

akshhack commented 1 week ago

@skirdey adding a few updates in 2 days that should help with stability. Thanks

RF5 commented 2 days ago

Hi @skirdey , please check the latest checkpoint release, it should be quite a bit more stable for pronunciation when using deep clone. Hope that helps!

skirdey commented 1 day ago

@RF5 sadly the quality got q output_mandatory_9_20240624_232539_0312d69b5cd84cb1877ddc100046cd83.zip uite worse, now the audio comes with artifacts just after 5-6 seconds

skirdey commented 1 day ago

Here is the code I use for generation

# Load the MARS5 model
def load_mars5_model():
    mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True, force_reload=True)
    return mars5, config_class

# Generate new audio using MARS5
def generate_audio(mars5, config_class, text, reference_audio, ref_transcript=None, deep_clone=True):
    cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100,
                      top_k=100, temperature=0.7, freq_penalty=3)
    ar_codes, output_audio = mars5.tts(text, reference_audio, ref_transcript, cfg=cfg)
    return output_audio

The quality degradation is consistent after 3-4 seconds with the latest checkpoint across all attempts. Previous checkpoint I had the quality degradation at around 12 seconds threshold.

The length of the original sample that is being used for deep clone is 10s.

RF5 commented 1 day ago

Hmm, a bit hard to debug this as from our evaluations we haven't seen such degradataion. Could you send us the audio reference and text that is giving you trouble and we can look into it?

skirdey commented 1 day ago

Sure, attached a reference archive, it has a shorter clip for cloning, result of the cloning (degradation at 12s) and the text that the original speaker says. test.zip

thhung commented 7 hours ago

Yes, I can confirm. The quality is varied a lot. When the audio is good, it does not follow the voice reference. When it follows the voice reference, the audio has many artifacts, not even pronounce properly. For the the code, I simply used your colab notebook.