Open skirdey opened 1 week ago
@skirdey adding a few updates in 2 days that should help with stability. Thanks
Hi @skirdey , please check the latest checkpoint release, it should be quite a bit more stable for pronunciation when using deep clone. Hope that helps!
@RF5 sadly the quality got q output_mandatory_9_20240624_232539_0312d69b5cd84cb1877ddc100046cd83.zip uite worse, now the audio comes with artifacts just after 5-6 seconds
Here is the code I use for generation
# Load the MARS5 model
def load_mars5_model():
mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True, force_reload=True)
return mars5, config_class
# Generate new audio using MARS5
def generate_audio(mars5, config_class, text, reference_audio, ref_transcript=None, deep_clone=True):
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100,
top_k=100, temperature=0.7, freq_penalty=3)
ar_codes, output_audio = mars5.tts(text, reference_audio, ref_transcript, cfg=cfg)
return output_audio
The quality degradation is consistent after 3-4 seconds with the latest checkpoint across all attempts. Previous checkpoint I had the quality degradation at around 12 seconds threshold.
The length of the original sample that is being used for deep clone is 10s.
Hmm, a bit hard to debug this as from our evaluations we haven't seen such degradataion. Could you send us the audio reference and text that is giving you trouble and we can look into it?
Sure, attached a reference archive, it has a shorter clip for cloning, result of the cloning (degradation at 12s) and the text that the original speaker says. test.zip
Yes, I can confirm. The quality is varied a lot. When the audio is good, it does not follow the voice reference. When it follows the voice reference, the audio has many artifacts, not even pronounce properly. For the the code, I simply used your colab notebook.
Currently, when using deep cloning, and maybe when not - the model starts producing artifacts after 12 seconds of total new audio generation. Was wondering if it is expected for current model checkpoint, or needs further troubleshooting?