152334H / DL-Art-School

TorToiSe fine-tuning with DLAS
GNU Affero General Public License v3.0
205 stars 86 forks source link

Clipped ending or doubled ending #61

Closed demonauthor closed 1 year ago

demonauthor commented 1 year ago

I'm having awesome results with fine-tuning datasets, but I am running into a couple issues:

  1. If I enter more than one sentence of text, the audio of the second line repeats itself. Not a big deal and easily editable but is there a setting/config change that might fix this?
  2. The last word of dialogue is clipped short. The above issue helps because I can delete the repeated phrase, but in cases where it's just one sentence... Is there a setting that can be changed to fix this?
xenotropic commented 1 year ago

1 - See issue 237 in the original tortoise repo, there's params you can try (I haven't had time to experiment yet) 2 - I've also noticed that, seems to be a failing of tortoise generally, not aware of any possible fixes

I'd be interested to hear what you've done to get awesome results -- what dataset size did you have, how many epochs, other hyperparamters. I have not yet managed to get awesome results.

demonauthor commented 1 year ago

1 - See issue 237 in the original tortoise repo, there's params you can try (I haven't had time to experiment yet) 2 - I've also noticed that, seems to be a failing of tortoise generally, not aware of any possible fixes

I'd be interested to hear what you've done to get awesome results -- what dataset size did you have, how many epochs, other hyperparamters. I have not yet managed to get awesome results.

I have been looking at the AI-Voice-Cloning setup. There is a setting in there called "pause time" which gets rid of the clipped last word. It seems to be much earlier in the build and is very slow, but coming along nicely. I'm still getting much better results with DLAS and Ozen.

The test I did was trying to clone Vincent Price's voice. I used three different audiobook readings he did. They are fairly clean and his speech is consistent. That yeilded about 500 clips using Ozen to create the dataset. Then I did 200 steps in DLAS and clicked the Auto Settings button. I have a separate set of clips I made for the voices folder that I can interchange to get different types of readings (specific emotions, rasp, voice pitch). Doesn't always work, but most of the time I get great results. Going to try 300 steps and see if the quality is any cleaner.

I've done 5 other tests with similarly recognizable voices (Walken, Jeff Goldblum, Louise from Bob's Burgers...) with equivalent results. The cadence isn't always right, but the tone, pronunciation etc are great. Instantly recognizable. Now I'm trying to combine voices to create specific sounds. Using this to do some preproduction for a film proof of concept, and it's working nicely. Sor of like a digital table read.

xenotropic commented 1 year ago

I found hyperparameters that worked better for me, see #1 . Mostly reducing lr for smaller datasets / single speaker.

For repeats, I experimented running each of length_penalty and repetition_penalty up to 1024, zero difference (super-helpful to have those exposed as script parameters in this repo).

It is oddly regular in that it always seems to affect an elements in a list, text of the form "blah blah, X, Y, and Z" being rendered as "blah blah, X, Y, and Z, and Z". If anyone has thoughts on what to experiment with to try to eliminate that, open to ideas.

demonauthor commented 1 year ago

The best way I have found to fix the clipping is just to add a space and then a single character to the end of the phrase. Then edit that final character out if it is pronounced. As for the doubling of the final line...breaking the text into shorter phrases fixes this. Shorter phrases also yield better "performances" overall.