152334H / tortoise-tts-fast

Fast TorToiSe inference (5x or your money back!)
GNU Affero General Public License v3.0
779 stars 179 forks source link

IndexError: too many indices for tensor of dimension 1 #73

Open DaveChini opened 1 year ago

DaveChini commented 1 year ago

Im getting this error when trying to run tts api example

#tort.py
import sys
sys.path.append('D:/AI/tortoise-tts-fast')
import tortoise.utils as utils
import tortoise.api as api

clips_paths = [
    "C:/Users/Dave/Desktop/sample_1.wav",
    "C:/Users/Dave/Desktop/sample_2.wav",
    "C:/Users/Dave/Desktop/sample_3.wav",
    "C:/Users/Dave/Desktop/sample_4.wav",
    "C:/Users/Dave/Desktop/sample_5.wav",
    "C:/Users/Dave/Desktop/sample_6.wav",
    "C:/Users/Dave/Desktop/sample_7.wav",
    "C:/Users/Dave/Desktop/sample_8.wav",
    "C:/Users/Dave/Desktop/sample_9.wav",
]
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

Error traceback

(tortoise) PS C:\Users\Dave\Desktop> python .\tort.py
D:\AI\tortoise-tts-fast\tortoise\utils\audio.py:19: WavFileWarning: Chunk (non-data) not understood, skipping it.
  sampling_rate, data = read(full_path)
mode 0
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\Dave\Desktop\.\tort.py:30 in <module>                                                   │
│                                                                                                  │
│   27 ]                                                                                           │
│   28 reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]                   │
│   29 tts = api.TextToSpeech()                                                                    │
│ ❱ 30 pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset=    │
│   31                                                                                             │
│                                                                                                  │
│ D:\AI\tortoise-tts-fast\tortoise\api.py:534 in tts_with_preset                                   │
│                                                                                                  │
│   531 │   │   }                                                                                  │
│   532 │   │   settings.update(presets[preset])                                                   │
│   533 │   │   settings.update(kwargs)  # allow overriding of preset settings with kwargs         │
│ ❱ 534 │   │   return self.tts(text, **settings)                                                  │
│   535 │                                                                                          │
│   536 │   def tts(                                                                               │
│   537 │   │   self,                                                                              │
│                                                                                                  │
│ D:\AI\tortoise-tts-fast\tortoise\api.py:631 in tts                                               │
│                                                                                                  │
│   628 │   │   │   │   diffusion_conditioning,                                                    │
│   629 │   │   │   │   auto_conds,                                                                │
│   630 │   │   │   │   _,                                                                         │
│ ❱ 631 │   │   │   ) = self.get_conditioning_latents(                                             │
│   632 │   │   │   │   voice_samples,                                                             │
│   633 │   │   │   │   return_mels=True,                                                          │
│   634 │   │   │   │   latent_averaging_mode=latent_averaging_mode,                               │
│                                                                                                  │
│ D:\AI\tortoise-tts-fast\tortoise\api.py:398 in get_conditioning_latents                          │
│                                                                                                  │
│   395 │   │   │                                                                                  │
│   396 │   │   │   auto_conds = []                                                                │
│   397 │   │   │   for ls in voice_samples:                                                       │
│ ❱ 398 │   │   │   │   auto_conds.append(format_conditioning(ls[0], device=self.device))          │
│   399 │   │   │   auto_conds = torch.stack(auto_conds, dim=1)                                    │
│   400 │   │   │   with self.temporary_cuda(self.autoregressive) as ar:                           │
│   401 │   │   │   │   auto_latent = ar.get_conditioning(auto_conds)                              │
│                                                                                                  │
│ D:\AI\tortoise-tts-fast\tortoise\api.py:78 in format_conditioning                                │
│                                                                                                  │
│    75 │   │   clip = F.pad(clip, pad=(0, abs(gap)))                                              │
│    76 │   elif gap > 0:                                                                          │
│    77 │   │   rand_start = random.randint(0, gap)                                                │
│ ❱  78 │   │   clip = clip[:, rand_start : rand_start + cond_length]                              │
│    79 │   mel_clip = TorchMelSpectrogram()(clip.unsqueeze(0)).squeeze(0)                         │
│    80 │   return mel_clip.unsqueeze(0).to(device)                                                │
│    81                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: too many indices for tensor of dimension 1
Icemaster-Eric commented 1 year ago

I'm getting the same issue. Any ideas?

Icemaster-Eric commented 1 year ago

Managed to debug it, the issue was with the voice loader. You need to change line 79 of tortoise.utils.audio to return (audio.unsqueeze(0), audio.unsqueeze(0)). However, the results I'm getting are extremely low quality, and voice cloning even has it switching to different voices in the middle of the generated output. I'm pretty sure this is the result of an error on my part somewhere, but I'm unsure what to do to fix this.