152334H / tortoise-tts-fast

Fast TorToiSe inference (5x or your money back!)
GNU Affero General Public License v3.0
771 stars 179 forks source link

Expanded latents generation #13

Closed hesz94 closed 1 year ago

hesz94 commented 1 year ago

Originally latents are generated utilizing only first 4.2(6)s from each voice sample file. This PR retains that option, as well as adding 2 alternatives that utilize almost entire voice files ( chunk_count = sample_duration // 4.2(6)s , effectively discarding last non-full chunk): 1) chunk mel conditioning matrices/vectors get averaged over each voice sample, before being passed to latent generation 2) all chunk mel conditioning matrices/vectors get sent to latent generation

In effect, method 1 gives equal weighting to entire voice samples, method 2 takes into account potentially varied sample durations.

With that in mind, the previously recommended sample length of circa 10 seconds (which is still confusing since only first 4.2s were used) is effectively removed, your samples can be however long you desire and will be used in their entirety ( bearing in mind that longer samples will lead to longer latents generation, however it's still a quick process at the end of the day, and the latents can then be saved to avoid re-generating them at every prompt.