152334H / tortoise-tts-fast

Fast TorToiSe inference (5x or your money back!)
GNU Affero General Public License v3.0
771 stars 179 forks source link

weird input voice sample treatment #7

Closed hesz94 closed 1 year ago

hesz94 commented 1 year ago

1) According to the readme the voice samples for voice cloning HAVE TO be at 22.05kHz, but then the first thing done to them after loading is resampling them to 24kHz - should prolly switch up to inputting non-default sampling rate if we're resampling anyway

2) After reading the voice samples and resampling them, only first 102400 samples (~4.27s at 24kHz fs) are used to generate the latents - as far as I'm aware nothing stops us from either taking more samples, or looping the process over existing samples and averaging latents for the specific voice. Tried both approaches locally and they seemed to have increased voice stability (which makes sense since such averaged latents should be more representative of a voice profile).

152334H commented 1 year ago
  1. the specific line was created at this commit, and subsequently never changed again. It shouldn't be hard for me to write code to read the sampling rate and adjust accordingly
  2. That's extremely interesting; I have no idea how I'd go about doing that. If you're up for it, I'd gladly accept a PR, otherwise I can try to take a stab at it blindly myself
hesz94 commented 1 year ago

Sure thing, I'll post it tomorrow

152334H commented 1 year ago

Redacted comment, found the problem.

tbh your fork seems to be doing well, i'm not sure if i ought to continue rather than archiving this and letting you do your thing

pepinu commented 1 year ago

as for 1. if tortoise was trained on LJSpeech dataset, it may be an artefact of that: LJSpeech Dataset - all samples within the dataset are prepared as 22.05kHz

nevertheless I've asked the author about this here

pepinu commented 1 year ago

https://github.com/neonbjb/tortoise-tts/issues/296#issuecomment-1420295707

can be closed

hesz94 commented 1 year ago

Added the expanded latent loading/averaging in #13

152334H commented 1 year ago

neonbjb#296 (comment)

can be closed

What I think @hesz94 meant in his first point was:

  1. input wav files are first converted to 22.050kHz: load_voice being called from load_voices produces voice_samples
  2. voice_samples are then used by tts() in get_conditioning_latents() at two lines: [here](), where they're required to be 22.050kHz, and here, where every sample is reconverted to 24kHz. This probably degrades the quality of the voice samples passed to the diffuser, because the inputs are going from original -> 22.05 -> 24.

so what i ought to do is to accept voice samples of any frequency, and specifically convert them to the right values for get_conditioning_latents, which I have implemented in this commit.

If any of this sounds wrong, please reopen the issue.