JarodMica / ai-voice-cloning

GNU General Public License v3.0
655 stars 144 forks source link

Inconsistent Results #125

Open Inthemorningsir opened 5 months ago

Inthemorningsir commented 5 months ago

Description: I've created my own model for ai-voice-cloning, which I previously used successfully with MRQ's repository (and this one). I was testing the right balance for computing time and output quality, but encountered some anomalies that I can't explain.

Steps to Reproduce:

  1. Trained a model using my custom dataset on MRQ's repository.
  2. Tested the updated version of the repository (including hifigan).
  3. Conducted several tests with different settings.
  4. Observed inconsistent results and anomalies.

Observations: Initial Success: In my tests, I achieved satisfactory results. Inconsistent Quality: When reverting to the same settings that initially worked, the results were consistently worse. Varying Generation Times: Noticed that smaller iteration numbers lead to faster generation times compared to before, and larger iteration numbers take longer than expected. If I set the iterations up to 400, it takes 50% longer now (100 vs 150 seconds), but using only 50 iterations takes 70% less time (110 vs 30 seconds), compared to before. Model Effectiveness: Effective results were observed with the normal autoregressive model, but using my custom model did not seem to make a significant difference.

Audio Examples: https://voca.ro/152YP9PCNgYl - Audio 1: Original voice to be recreated https://voca.ro/1cuIxoocxzuT - Audio 2: Autoregressive example 1 { "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": null, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "113.622", "datetime": "2024-06-12T11:31:40.068903", "model": "./models/tortoise/autoregressive.pth", "model_hash": "d1f7923277a74e6f1a293fc6c25ecb88" }

After testing, reverting to old settings: https://voca.ro/1o3QG7V314OF - Audio 3: Autoregressive example 3 { "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": 1718264120, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "33.843", "datetime": "2024-06-13T07:35:54.454854", "model": "./models/tortoise/autoregressive.pth", "model_hash": "d1f7923277a74e6f1a293fc6c25ecb88" } https://voca.ro/1c7jEQQ7cxYy - Audio 4: Own model example 1 { "text": "While the emphasis on individual psychological conditioning was prevalent during the mid-20th century, what shapes how people feel is the society around them.", "delimiter": "\\n", "emotion": "None", "prompt": "", "voice": "klaasje", "mic_audio": null, "voice_latents_chunks": 2, "candidates": 1, "seed": 1718265643, "num_autoregressive_samples": 4, "diffusion_iterations": 50, "temperature": 0.2, "diffusion_sampler": "DDIM", "breathing_room": 8, "cvvp_weight": 0, "top_p": 0.8, "diffusion_temperature": 1, "length_penalty": 1, "repetition_penalty": 2, "cond_free_k": 2, "experimentals": [ "Conditioning-Free" ], "voice_latents_original_ar": false, "voice_latents_original_diffusion": false, "time": "40.140", "datetime": "2024-06-13T08:01:19.384412", "model": "./models/finetunes/420_gpt.pth", "model_hash": "fa2abaecea4eeb28d2a460cd83fedb4d" }

Could you please help identify any possible issues that might be causing these inconsistencies? Have I potentially missed anything really obvious? Thank you, it's greatly appreciated.