coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.7k stars 4.21k forks source link

[Bug] Assertion fails when using 16 wav samples #2588

Closed dennisvink closed 1 year ago

dennisvink commented 1 year ago

Describe the bug

Using the steps from Tutorial_2_train_your_first_TTS_model.ipynb I've recorded a couple of .wav files (16 in total) and created the metadata.csv file. The training starts a run, appears to do some preliminary analysis and then asserts without a clear error message.

I'm at a loss. I can't find anything wrong with my (small) data set nor metadata. Any pointers?

training.py looks like this:

    formatter="ljspeech", meta_file_train="metadata.csv", path="data"
)

output_path = "/home/voice/output/"
if not os.path.exists(output_path):
    os.makedirs(output_path)

# GlowTTSConfig: all model related values for training, validating and testing.
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
config = GlowTTSConfig(
    batch_size=32,
    eval_batch_size=16,
    eval_split_size=0.0625,
    num_loader_workers=2,
    num_eval_loader_workers=2,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=100,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    save_step=1000,
)```

Backtrace is as follows:

```/home/voice/.local/lib/python3.8/site-packages/librosa/core/spectrum.py:256: UserWarning: n_fft=1024 is too large for input signal of length=2
  warnings.warn(
 ! Run is removed from /home/voice/output/run-May-04-2023_08+51PM-0000000
Traceback (most recent call last):
  File "/home/voice/.local/lib/python3.8/site-packages/trainer/trainer.py", line 1591, in fit
    self._fit()
  File "/home/voice/.local/lib/python3.8/site-packages/trainer/trainer.py", line 1544, in _fit
    self.train_epoch()
  File "/home/voice/.local/lib/python3.8/site-packages/trainer/trainer.py", line 1308, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/home/voice/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/home/voice/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/voice/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/voice/.local/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/voice/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/voice/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/voice/.local/lib/python3.8/site-packages/TTS/tts/datasets/dataset.py", line 464, in collate_fn
    mel = prepare_tensor(mel, self.outputs_per_step)
  File "/home/voice/.local/lib/python3.8/site-packages/TTS/tts/utils/data.py", line 29, in prepare_tensor
    return np.stack([_pad_tensor(x, pad_len) for x in inputs])
  File "/home/voice/.local/lib/python3.8/site-packages/TTS/tts/utils/data.py", line 29, in <listcomp>
    return np.stack([_pad_tensor(x, pad_len) for x in inputs])
  File "/home/voice/.local/lib/python3.8/site-packages/TTS/tts/utils/data.py", line 20, in _pad_tensor
    assert x.ndim == 2
AssertionError```

### To Reproduce

Record 16 .wavs
Use the config above

Go through the tutorial notebook

Use this metadata file:

```LJ001-0001|Click the red Star Record button above to start recording.|Click the red Star Record button above to start recording.
LJ001-0002|While recording, you can pause and resume recording by clicking the appropiate button.|While recording, you can pause and resume recording by clicking the appropiate button.
LJ001-0003|When you are finished recording, click the Stop Recording button.|When you are finished recording, click the Stop Recording button.
LJ001-0004|You can save recoridng sound to your computer, or you can choose to cut and edit sound.|You can save recoridng sound to your computer, or you can choose to cut and edit sound.
LJ001-0005|If you choose to edit sound, go to the editing page.|If you choose to edit sound, go to the editing page.
LJ001-0006|After the modification is completed, you can save to the computer.|After the modification is completed, you can save to the computer.
LJ001-0007|The saved format can be MP3, WAV, OGG etcetera.|The saved format can be MP3, WAV, OGG etcetera.
LJ001-0008|Features include start recording, pause recording, resume recording, stop recording and real-time display of recording time, waveform, data size and other information.|Features include start recording, pause recording, resume recording, stop recording and real-time display of recording time, waveform, data size and other information.
LJ001-0009|Based on the standard interface of bootstrap recording can be done in 3 easy steps.|Based on the standard interface of bootstrap recording can be done in 3 easy steps.
LJ001-0010|There are no complicated settings and options so click with the mouse to complete.|There are no complicated settings and options so click with the mouse to complete.
LJ001-0011|The combination of recording and editing integrates the functions of recording and editing.|The combination of recording and editing integrates the functions of recording and editing.
LJ001-0012|Your computer device needs a microphone and sound card.|Your computer device needs a microphone and sound card.
LJ001-0013|This program can be used under any operating system, including Windows, Mac, Linux, etcetera.|This program can be used under any operating system, including Windows, Mac, Linux, etcetera.
LJ001-0014|You need to allow your browser to use microphone device.|You need to allow your browser to use microphone device.
LJ001-0015|HTML is the latest technical standard for web browsers and it supports input, processing, and saving of audio directly in the browser.|HTML is the latest technical standard for web browsers and it supports input, processing, and saving of audio directly in the browser.
LJ001-0016|This program provides complete editing functions that include: cut, fade in, fade out, change volume, and many other things.|This program provides complete editing functions that include: cut, fade in, fade out, change volume, and many other things.```

Run training.py

### Expected behavior

Completion, no assertion

### Logs

```shell
See bug description

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.0+cu117",
        "TTS": "0.13.3",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.10",
        "version": "#37~20.04.1-Ubuntu SMP Fri Mar 17 11:39:30 UTC 2023"
    }
}

Additional context

No response

erogol commented 1 year ago

You have multi-channel audio in the dataset. Convert them mono.