Description:
I've created my own model for ai-voice-cloning, which I previously used successfully with MRQ's repository (and this one). I was testing the right balance for computing time and output quality, but encountered some anomalies that I can't explain.
Steps to Reproduce:
Trained a model using my custom dataset on MRQ's repository.
Tested the updated version of the repository (including hifigan).
Conducted several tests with different settings.
Observed inconsistent results and anomalies.
Observations:
Initial Success: In my tests, I achieved satisfactory results.
Inconsistent Quality: When reverting to the same settings that initially worked, the results were consistently worse.
Varying Generation Times: Noticed that smaller iteration numbers lead to faster generation times compared to before, and larger iteration numbers take longer than expected. If I set the iterations up to 400, it takes 50% longer now (100 vs 150 seconds), but using only 50 iterations takes 70% less time (110 vs 30 seconds), compared to before.
Model Effectiveness: Effective results were observed with the normal autoregressive model, but using my custom model did not seem to make a significant difference.
Audio Examples:https://voca.ro/152YP9PCNgYl - Audio 1: Original voice to be recreated
https://voca.ro/1cuIxoocxzuT - Audio 2: Autoregressive example 1
{ "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": null, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "113.622", "datetime": "2024-06-12T11:31:40.068903", "model": "./models/tortoise/autoregressive.pth", "model_hash": "d1f7923277a74e6f1a293fc6c25ecb88" }
After testing, reverting to old settings:https://voca.ro/1o3QG7V314OF - Audio 3: Autoregressive example 3
{ "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": 1718264120, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "33.843", "datetime": "2024-06-13T07:35:54.454854", "model": "./models/tortoise/autoregressive.pth", "model_hash": "d1f7923277a74e6f1a293fc6c25ecb88" }https://voca.ro/1c7jEQQ7cxYy - Audio 4: Own model example 1
{ "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": 1718265643, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "40.140", "datetime": "2024-06-13T08:01:19.384412", "model": "./models/finetunes/420_gpt.pth", "model_hash": "fa2abaecea4eeb28d2a460cd83fedb4d" }
Could you please help identify any possible issues that might be causing these inconsistencies? Have I potentially missed anything really obvious? Thank you, it's greatly appreciated.
Description: I've created my own model for ai-voice-cloning, which I previously used successfully with MRQ's repository (and this one). I was testing the right balance for computing time and output quality, but encountered some anomalies that I can't explain.
Steps to Reproduce:
Observations: Initial Success: In my tests, I achieved satisfactory results. Inconsistent Quality: When reverting to the same settings that initially worked, the results were consistently worse. Varying Generation Times: Noticed that smaller iteration numbers lead to faster generation times compared to before, and larger iteration numbers take longer than expected. If I set the iterations up to 400, it takes 50% longer now (100 vs 150 seconds), but using only 50 iterations takes 70% less time (110 vs 30 seconds), compared to before. Model Effectiveness: Effective results were observed with the normal autoregressive model, but using my custom model did not seem to make a significant difference.
Audio Examples: https://voca.ro/152YP9PCNgYl - Audio 1: Original voice to be recreated https://voca.ro/1cuIxoocxzuT - Audio 2: Autoregressive example 1
{ "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": null, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "113.622", "datetime": "2024-06-12T11:31:40.068903", "model": "./models/tortoise/autoregressive.pth", "model_hash": "d1f7923277a74e6f1a293fc6c25ecb88" }
After testing, reverting to old settings: https://voca.ro/1o3QG7V314OF - Audio 3: Autoregressive example 3
{ "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": 1718264120, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "33.843", "datetime": "2024-06-13T07:35:54.454854", "model": "./models/tortoise/autoregressive.pth", "model_hash": "d1f7923277a74e6f1a293fc6c25ecb88" }
https://voca.ro/1c7jEQQ7cxYy - Audio 4: Own model example 1{ "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": 1718265643, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "40.140", "datetime": "2024-06-13T08:01:19.384412", "model": "./models/finetunes/420_gpt.pth", "model_hash": "fa2abaecea4eeb28d2a460cd83fedb4d" }
Could you please help identify any possible issues that might be causing these inconsistencies? Have I potentially missed anything really obvious? Thank you, it's greatly appreciated.