coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
33.48k stars 4.07k forks source link

[Bug] Test sentence error when resuming training #2070

Closed str20tbl closed 1 year ago

str20tbl commented 1 year ago

Describe the bug

When using TrainerArgs(continue_path="") to resuming training the test sentences fail to generate, giving an unexpected input error as listed below.

To be clear first run is from scratch, the second run I then add in the continue_path and it trains fine until it tries to generate the test sentences.

I was getting the same error when training from scratch if I use test_sentences = [[""],[""]] hence the ugly list of test sentences below.

To Reproduce

import os

from trainer import Trainer, TrainerArgs
from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.configs.shared_configs import CharactersConfig

voice_name = "benyw-de"
output_path = "/code/data/runs/"
phoneme_cache = path = os.path.join(output_path, "../" + voice_name, "phoneme_cache")
dataset_config = BaseDatasetConfig(
    formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "../" + voice_name)
)

audio_config = BaseAudioConfig(
    sample_rate=44100,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    preemphasis=0.0,
    ref_level_db=20,
    log_func="np.log",
    do_trim_silence=True,
    trim_db=60,
    mel_fmin=0,
    mel_fmax=None,
    spec_gain=1.0,
    signal_norm=True,
    resample=False,
    do_amp_to_db_linear=False,
    power=2,
)

character_config = CharactersConfig(
    characters="ABCDEFGHIJKLMNOPQRSTUVWXYZÁÀÂÄÉÊËÎÏÔÖÔÛŴŶŸabcdefghijklmnopqrstuvwyxzáàâäéêëîïôöôûŵŷÿ",
    phonemes="θˈetwomçʊplanhiːɡjɔnaðɛvnʌðɨrɨrɔɪblɑːsarɔɡlnəɨlɪuarvuɨdɨupˌɛrlʌʃəɨønxkɬfŋʒz",
    pad="_",
    eos="~",
    bos="^",
    punctuations="!'’(),-.:;? ",
)

config = VitsConfig(
    audio=audio_config,
    run_name=voice_name,
    batch_size=32,
    eval_batch_size=16,
    batch_group_size=5,
    num_loader_workers=0,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="cy",
    phoneme_cache_path=phoneme_cache,
    compute_input_seq_cache=True,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    characters=character_config,
    test_sentences=[
        "It took me quite a long time to develop a voice and now that I have it, I'm not going to be quiet",
        "Gymrodd dipyn o amser i mi ddatblygu llais a nawr mae gen i, nid wyf yn mynd i fod yn dawel",
        "Bydd y trên nesaf am Aberystwyth yn gadael platfform 3 am gwater wedi dau o'r gloch",
        "The next train for Aberystwyth is leaving platform three at quarter past two o'clock",
        "PA FATH O ADNODDAU?" + "Mae ystod eang o adnoddau technoleg iaith Cymraeg ar gael o'r "
        + "Porth Technolegau Iaith, gan gynnwys adnoddau dadansoddi testunau Cymraeg, gwirio sillafu a "
        + "gramadeg Cymraeg, cyfieithu, testun i leferydd a llawer mwy. Yn ogystal mae cymorth ar ffurf "
        + "projectau enghreifftiol a thiwtorialau ar sut i ddefnyddio'r adnoddau.",
        "WHAT KIND OF RESOURCES? The Language Technologies Portal consists of a wide range of resources "
        + "for the Welsh language, including Welsh language text processing, spelling and grammar "
        + "checking, translation, text to speech and lots more. In addition there is support in the form "
        + "of tutorials and demo projects"
    ]
)

ap = AudioProcessor.init_from_config(config)

tokenizer, config = TTSTokenizer.init_from_config(config)

train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

model = Vits(config, ap, tokenizer, speaker_manager=None)

trainer = Trainer(
    TrainerArgs(
        continue_path="/code/data/runs/benyw-de-October-04-2022_03+15PM-0000000"
    ),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)

trainer.fit()

Expected behavior

To resume training

Logs

fatal: not a git repository (or any parent up to mount point /code)
 Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
  > Training Environment:
  | > Current device: 0
  | > Num. of GPUs: 1
  | > Num. of CPUs: 20
  | > Num. of Torch Threads: 10
  | > Torch seed: 54321
  | > Torch CUDNN: True
  | > Torch CUDNN deterministic: False
  | > Torch CUDNN benchmark: False
  > Restoring from checkpoint_370000.pth ...
  > Restoring Model...
  > Restoring Optimizer...
  > Restoring Scaler...
  > Model restored from step 370000

  > Model has 83044972 parameters
  > Restoring best loss from best_model_38077.pth ...
  > Starting with loaded last best loss 15.811285

  > EPOCH: 0/1000
  --> /code/data/runs/benyw-de-October-04-2022_03+15PM-0000000

  > TRAINING (2022-10-10 08:56:23) 
 /opt/venv/lib/python3.8/site-packages/torch/functional.py:572: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  ../aten/src/ATen/native/SpectralOps.cpp:659.)
   return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]

    --> STEP: 24/377 -- GLOBAL_STEP: 370025
      | > loss_disc: 2.55044  (2.48995)
      | > loss_disc_real_0: 0.14567  (0.12905)
      | > loss_disc_real_1: 0.21341  (0.21678)
      | > loss_disc_real_2: 0.22256  (0.21309)
      | > loss_disc_real_3: 0.25495  (0.22890)
      | > loss_disc_real_4: 0.26215  (0.23993)
      | > loss_disc_real_5: 0.26839  (0.23704)
      | > loss_0: 2.55044  (2.48995)
      | > grad_norm_0: 71.16961  (27.26500)
      | > loss_gen: 2.27486  (2.29334)
      | > loss_kl: 1.51007  (1.61466)
      | > loss_feat: 5.67015  (6.71701)
      | > loss_mel: 16.80119  (17.38435)
      | > loss_duration: 2.46973  (2.48439)
      | > amp_scaler: 128.00000  (1477.33333)
      | > loss_1: 28.72600  (30.49375)
      | > grad_norm_1: 899.00659  (482.22906)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.55960  (0.62170)
      | > loader_time: 0.10720  (0.10090)

    --> STEP: 49/377 -- GLOBAL_STEP: 370050
      | > loss_disc: 2.40120  (2.51074)
      | > loss_disc_real_0: 0.12095  (0.12814)
      | > loss_disc_real_1: 0.23565  (0.21559)
      | > loss_disc_real_2: 0.23974  (0.21866)
      | > loss_disc_real_3: 0.22481  (0.23480)
      | > loss_disc_real_4: 0.22896  (0.24172)
      | > loss_disc_real_5: 0.23540  (0.23981)
      | > loss_0: 2.40120  (2.51074)
      | > grad_norm_0: 58.95755  (36.29088)
      | > loss_gen: 2.45456  (2.27450)
      | > loss_kl: 1.46052  (1.52882)
      | > loss_feat: 7.02408  (6.42669)
      | > loss_mel: 17.61349  (17.25639)
      | > loss_duration: 2.53441  (2.48982)
      | > amp_scaler: 64.00000  (787.59184)
      | > loss_1: 31.08706  (29.97622)
      | > grad_norm_1: 0.00000  (574.39398)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.53990  (0.63267)
      | > loader_time: 0.11400  (0.10760)

    --> STEP: 74/377 -- GLOBAL_STEP: 370075
      | > loss_disc: 2.49662  (2.50949)
      | > loss_disc_real_0: 0.12039  (0.12783)
      | > loss_disc_real_1: 0.21889  (0.21689)
      | > loss_disc_real_2: 0.21446  (0.21747)
      | > loss_disc_real_3: 0.25341  (0.23359)
      | > loss_disc_real_4: 0.24759  (0.24057)
      | > loss_disc_real_5: 0.26163  (0.24075)
      | > loss_0: 2.49662  (2.50949)
      | > grad_norm_0: 36.04248  (47.97209)
      | > loss_gen: 2.51684  (2.27085)
      | > loss_kl: 1.46986  (1.49425)
      | > loss_feat: 6.42071  (6.28306)
      | > loss_mel: 17.80134  (17.11130)
      | > loss_duration: 2.55465  (2.48931)
      | > amp_scaler: 64.00000  (543.13514)
      | > loss_1: 30.76340  (29.64877)
      | > grad_norm_1: 1184.71509  (690.00647)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.58500  (0.62094)
      | > loader_time: 0.11950  (0.11202)

    --> STEP: 99/377 -- GLOBAL_STEP: 370100
      | > loss_disc: 2.42966  (2.50192)
      | > loss_disc_real_0: 0.12394  (0.12583)
      | > loss_disc_real_1: 0.22320  (0.21751)
      | > loss_disc_real_2: 0.21819  (0.21700)
      | > loss_disc_real_3: 0.24807  (0.23354)
      | > loss_disc_real_4: 0.31490  (0.23937)
      | > loss_disc_real_5: 0.25939  (0.24174)
      | > loss_0: 2.42966  (2.50192)
      | > grad_norm_0: 33.07309  (46.75923)
      | > loss_gen: 2.35781  (2.28785)
      | > loss_kl: 1.44811  (1.47072)
      | > loss_feat: 5.79762  (6.20053)
      | > loss_mel: 16.15477  (16.99908)
      | > loss_duration: 2.47667  (2.49471)
      | > amp_scaler: 64.00000  (422.14141)
      | > loss_1: 28.23499  (29.45289)
      | > grad_norm_1: 988.29980  (719.70038)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.58830  (0.61360)
      | > loader_time: 0.12390  (0.11479)

    --> STEP: 124/377 -- GLOBAL_STEP: 370125
      | > loss_disc: 2.49874  (2.49829)
      | > loss_disc_real_0: 0.09822  (0.12284)
      | > loss_disc_real_1: 0.22356  (0.21757)
      | > loss_disc_real_2: 0.23667  (0.21665)
      | > loss_disc_real_3: 0.23617  (0.23491)
      | > loss_disc_real_4: 0.24782  (0.23939)
      | > loss_disc_real_5: 0.22937  (0.24127)
      | > loss_0: 2.49874  (2.49829)
      | > grad_norm_0: 62.37086  (48.48700)
      | > loss_gen: 2.21541  (2.28772)
      | > loss_kl: 1.35740  (1.45254)
      | > loss_feat: 5.52345  (6.11881)
      | > loss_mel: 16.70741  (16.93327)
      | > loss_duration: 2.48901  (2.49822)
      | > amp_scaler: 64.00000  (349.93548)
      | > loss_1: 28.29268  (29.29056)
      | > grad_norm_1: 1352.12561  (788.80066)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.60390  (0.60945)
      | > loader_time: 0.13510  (0.11739)

    --> STEP: 149/377 -- GLOBAL_STEP: 370150
      | > loss_disc: 2.50388  (2.49847)
      | > loss_disc_real_0: 0.08631  (0.12228)
      | > loss_disc_real_1: 0.21722  (0.21705)
      | > loss_disc_real_2: 0.21443  (0.21673)
      | > loss_disc_real_3: 0.23007  (0.23471)
      | > loss_disc_real_4: 0.23048  (0.23858)
      | > loss_disc_real_5: 0.24302  (0.24132)
      | > loss_0: 2.50388  (2.49847)
      | > grad_norm_0: 28.48228  (53.97705)
      | > loss_gen: 2.26191  (2.28681)
      | > loss_kl: 1.47759  (1.44310)
      | > loss_feat: 5.18443  (6.06661)
      | > loss_mel: 15.30055  (16.90019)
      | > loss_duration: 2.53379  (2.50102)
      | > amp_scaler: 64.00000  (301.95973)
      | > loss_1: 26.75827  (29.19773)
      | > grad_norm_1: 949.99237  (813.57202)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.60550  (0.60962)
      | > loader_time: 0.13350  (0.12009)

    --> STEP: 174/377 -- GLOBAL_STEP: 370175
      | > loss_disc: 2.39914  (2.49854)
      | > loss_disc_real_0: 0.07777  (0.12209)
      | > loss_disc_real_1: 0.19062  (0.21694)
      | > loss_disc_real_2: 0.24513  (0.21669)
      | > loss_disc_real_3: 0.24207  (0.23474)
      | > loss_disc_real_4: 0.19594  (0.23828)
      | > loss_disc_real_5: 0.25583  (0.24149)
      | > loss_0: 2.39914  (2.49854)
      | > grad_norm_0: 15.68722  (55.34041)
      | > loss_gen: 2.28351  (2.28238)
      | > loss_kl: 1.31015  (1.43668)
      | > loss_feat: 5.50243  (5.99044)
      | > loss_mel: 15.72582  (16.81423)
      | > loss_duration: 2.54678  (2.50541)
      | > amp_scaler: 64.00000  (267.77011)
      | > loss_1: 27.36870  (29.02916)
      | > grad_norm_1: 310.45511  (798.36810)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.61040  (0.60996)
      | > loader_time: 0.14050  (0.12249)

    --> STEP: 199/377 -- GLOBAL_STEP: 370200
      | > loss_disc: 2.58452  (2.49874)
      | > loss_disc_real_0: 0.14357  (0.12228)
      | > loss_disc_real_1: 0.23378  (0.21645)
      | > loss_disc_real_2: 0.20141  (0.21621)
      | > loss_disc_real_3: 0.25487  (0.23460)
      | > loss_disc_real_4: 0.24570  (0.23820)
      | > loss_disc_real_5: 0.26007  (0.24163)
      | > loss_0: 2.58452  (2.49874)
      | > grad_norm_0: 41.92390  (57.42381)
      | > loss_gen: 2.31018  (2.28013)
      | > loss_kl: 1.45325  (1.43337)
      | > loss_feat: 6.08208  (5.94675)
      | > loss_mel: 16.64436  (16.78786)
      | > loss_duration: 2.49012  (2.50804)
      | > amp_scaler: 64.00000  (242.17085)
      | > loss_1: 28.97999  (28.95618)
      | > grad_norm_1: 1378.86316  (800.98450)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.61740  (0.61082)
      | > loader_time: 0.14300  (0.12491)

    --> STEP: 224/377 -- GLOBAL_STEP: 370225
      | > loss_disc: 2.46040  (2.49742)
      | > loss_disc_real_0: 0.11919  (0.12139)
      | > loss_disc_real_1: 0.21821  (0.21661)
      | > loss_disc_real_2: 0.21805  (0.21626)
      | > loss_disc_real_3: 0.24330  (0.23431)
      | > loss_disc_real_4: 0.21288  (0.23812)
      | > loss_disc_real_5: 0.23401  (0.24184)
      | > loss_0: 2.46040  (2.49742)
      | > grad_norm_0: 62.31737  (58.49862)
      | > loss_gen: 2.30038  (2.27995)
      | > loss_kl: 1.41603  (1.43048)
      | > loss_feat: 6.09071  (5.91580)
      | > loss_mel: 16.84660  (16.75456)
      | > loss_duration: 2.55037  (2.50907)
      | > amp_scaler: 64.00000  (222.28571)
      | > loss_1: 29.20410  (28.88987)
      | > grad_norm_1: 1142.81165  (810.06427)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.64070  (0.61229)
      | > loader_time: 0.14770  (0.12733)

    --> STEP: 249/377 -- GLOBAL_STEP: 370250
      | > loss_disc: 2.41278  (2.49977)
      | > loss_disc_real_0: 0.07805  (0.12119)
      | > loss_disc_real_1: 0.20048  (0.21695)
      | > loss_disc_real_2: 0.16089  (0.21622)
      | > loss_disc_real_3: 0.24116  (0.23509)
      | > loss_disc_real_4: 0.20539  (0.23839)
      | > loss_disc_real_5: 0.22120  (0.24181)
      | > loss_0: 2.41278  (2.49977)
      | > grad_norm_0: 86.82614  (62.35238)
      | > loss_gen: 2.28892  (2.27705)
      | > loss_kl: 1.34396  (1.42752)
      | > loss_feat: 5.24624  (5.87454)
      | > loss_mel: 15.47409  (16.73135)
      | > loss_duration: 2.51742  (2.51081)
      | > amp_scaler: 64.00000  (206.39357)
      | > loss_1: 26.87063  (28.82129)
      | > grad_norm_1: 1102.44116  (831.24805)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.65500  (0.61489)
      | > loader_time: 0.15630  (0.12990)

    --> STEP: 274/377 -- GLOBAL_STEP: 370275
      | > loss_disc: 2.51191  (2.49994)
      | > loss_disc_real_0: 0.06421  (0.12171)
      | > loss_disc_real_1: 0.19463  (0.21714)
      | > loss_disc_real_2: 0.16270  (0.21622)
      | > loss_disc_real_3: 0.24546  (0.23505)
      | > loss_disc_real_4: 0.20212  (0.23802)
      | > loss_disc_real_5: 0.24462  (0.24191)
      | > loss_0: 2.51191  (2.49994)
      | > grad_norm_0: 46.68482  (66.10759)
      | > loss_gen: 2.20772  (2.27897)
      | > loss_kl: 1.40215  (1.42454)
      | > loss_feat: 5.80042  (5.84435)
      | > loss_mel: 16.68928  (16.70909)
      | > loss_duration: 2.48549  (2.51028)
      | > amp_scaler: 64.00000  (193.40146)
      | > loss_1: 28.58506  (28.76727)
      | > grad_norm_1: 1204.98669  (851.48755)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.64200  (0.61749)
      | > loader_time: 0.15860  (0.13237)

    --> STEP: 299/377 -- GLOBAL_STEP: 370300
      | > loss_disc: 2.35909  (2.49897)
      | > loss_disc_real_0: 0.08569  (0.12071)
      | > loss_disc_real_1: 0.21756  (0.21732)
      | > loss_disc_real_2: 0.18440  (0.21666)
      | > loss_disc_real_3: 0.22291  (0.23499)
      | > loss_disc_real_4: 0.18121  (0.23730)
      | > loss_disc_real_5: 0.23695  (0.24172)
      | > loss_0: 2.35909  (2.49897)
      | > grad_norm_0: 29.11191  (67.94752)
      | > loss_gen: 2.36991  (2.28058)
      | > loss_kl: 1.37026  (1.42306)
      | > loss_feat: 5.85688  (5.82307)
      | > loss_mel: 16.07250  (16.68771)
      | > loss_duration: 2.56373  (2.51004)
      | > amp_scaler: 64.00000  (182.58194)
      | > loss_1: 28.23328  (28.72448)
      | > grad_norm_1: 958.85083  (859.88385)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.65280  (0.62109)
      | > loader_time: 0.16590  (0.13497)

    --> STEP: 324/377 -- GLOBAL_STEP: 370325
      | > loss_disc: 2.53370  (2.49874)
      | > loss_disc_real_0: 0.16366  (0.12012)
      | > loss_disc_real_1: 0.25245  (0.21747)
      | > loss_disc_real_2: 0.16801  (0.21675)
      | > loss_disc_real_3: 0.27879  (0.23552)
      | > loss_disc_real_4: 0.26844  (0.23788)
      | > loss_disc_real_5: 0.26195  (0.24183)
      | > loss_0: 2.53370  (2.49874)
      | > grad_norm_0: 121.98225  (68.71590)
      | > loss_gen: 2.19249  (2.28587)
      | > loss_kl: 1.41554  (1.42068)
      | > loss_feat: 4.91315  (5.80104)
      | > loss_mel: 16.33810  (16.65991)
      | > loss_duration: 2.49696  (2.50992)
      | > amp_scaler: 64.00000  (173.43210)
      | > loss_1: 27.35625  (28.67745)
      | > grad_norm_1: 818.79114  (848.46521)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.68870  (0.62467)
      | > loader_time: 0.17230  (0.13780)

    --> STEP: 349/377 -- GLOBAL_STEP: 370350
      | > loss_disc: 2.35437  (2.49853)
      | > loss_disc_real_0: 0.08545  (0.11980)
      | > loss_disc_real_1: 0.21104  (0.21723)
      | > loss_disc_real_2: 0.20319  (0.21665)
      | > loss_disc_real_3: 0.18614  (0.23552)
      | > loss_disc_real_4: 0.16213  (0.23706)
      | > loss_disc_real_5: 0.21345  (0.24158)
      | > loss_0: 2.35437  (2.49853)
      | > grad_norm_0: 29.75313  (69.62205)
      | > loss_gen: 2.48546  (2.28384)
      | > loss_kl: 1.36163  (1.41934)
      | > loss_feat: 5.91416  (5.78107)
      | > loss_mel: 16.40535  (16.64863)
      | > loss_duration: 2.47625  (2.50986)
      | > amp_scaler: 64.00000  (165.59312)
      | > loss_1: 28.64283  (28.64276)
      | > grad_norm_1: 446.78738  (836.56036)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.71450  (0.63001)
      | > loader_time: 0.18870  (0.14188)

    --> STEP: 374/377 -- GLOBAL_STEP: 370375
      | > loss_disc: 2.24464  (2.49672)
      | > loss_disc_real_0: 0.16075  (0.11888)
      | > loss_disc_real_1: 0.16895  (0.21738)
      | > loss_disc_real_2: 0.20525  (0.21702)
      | > loss_disc_real_3: 0.17370  (0.23456)
      | > loss_disc_real_4: 0.05813  (0.23714)
      | > loss_disc_real_5: 0.21070  (0.24214)
      | > loss_0: 2.24464  (2.49672)
      | > grad_norm_0: 79.58676  (68.79191)
      | > loss_gen: 2.22154  (2.29015)
      | > loss_kl: 1.37339  (1.41934)
      | > loss_feat: 5.77240  (5.77355)
      | > loss_mel: 15.62250  (16.63053)
      | > loss_duration: 2.51898  (2.51003)
      | > amp_scaler: 64.00000  (158.80214)
      | > loss_1: 27.50881  (28.62361)
      | > grad_norm_1: 627.33990  (818.97894)
      | > current_lr_0: 0.00018 
      | > current_lr_1: 0.00018 
      | > step_time: 0.75170  (0.63662)
      | > loader_time: 0.22000  (0.14595)

  > EVALUATION 

  > Setting up Audio Processor...
  | > sample_rate:44100
  | > resample:False
  | > num_mels:80
  | > log_func:np.log
  | > min_level_db:-100
  | > frame_shift_ms:None
  | > frame_length_ms:None
  | > ref_level_db:20
  | > fft_size:1024
  | > power:2
  | > preemphasis:0.0
  | > griffin_lim_iters:60
  | > signal_norm:True
  | > symmetric_norm:True
  | > mel_fmin:0
  | > mel_fmax:None
  | > pitch_fmin:1.0
  | > pitch_fmax:640.0
  | > spec_gain:1.0
  | > stft_pad_mode:reflect
  | > max_norm:4.0
  | > clip_norm:True
  | > do_trim_silence:True
  | > trim_db:60
  | > do_sound_norm:False
  | > do_amp_to_db_linear:False
  | > do_amp_to_db_mel:True
  | > do_rms_norm:False
  | > db_level:None
  | > stats_path:None
  | > base:2.718281828459045
  | > hop_length:256
  | > win_length:1024
  | > Found 12185 files in /code/data/benyw-de

 > DataLoader initialization
 | > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: cy
        | > phoneme backend: espeak
 | > Number of instances : 12064
  | > Preprocessing samples
  | > Max text length: 109
  | > Min text length: 35
  | > Avg text length: 58.87151856763926
  | 
  | > Max audio length: 1037992.0
  | > Min audio length: 159297.0
  | > Avg audio length: 375388.7858090186
  | > Num. instances discarded samples: 0
  | > Batch group size: 160.

 > DataLoader initialization
 | > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: cy
        | > phoneme backend: espeak
 | > Number of instances : 121
  | > Preprocessing samples
  | > Max text length: 90
  | > Min text length: 41
  | > Avg text length: 59.72727272727273
  | 
  | > Max audio length: 579685.0
  | > Min audio length: 179748.0
  | > Avg audio length: 376694.19008264464
  | > Num. instances discarded samples: 0
  | > Batch group size: 0.
    --> STEP: 0
      | > loss_disc: 2.72943  (2.72943)
      | > loss_disc_real_0: 0.17089  (0.17089)
      | > loss_disc_real_1: 0.27351  (0.27351)
      | > loss_disc_real_2: 0.23078  (0.23078)
      | > loss_disc_real_3: 0.20429  (0.20429)
      | > loss_disc_real_4: 0.27960  (0.27960)
      | > loss_disc_real_5: 0.21382  (0.21382)
      | > loss_0: 2.72943  (2.72943)
      | > loss_gen: 2.03731  (2.03731)
      | > loss_kl: 2.76852  (2.76852)
      | > loss_feat: 5.03488  (5.03488)
      | > loss_mel: 17.20503  (17.20503)
      | > loss_duration: 3.14819  (3.14819)
      | > loss_1: 30.19394  (30.19394)

    --> STEP: 1
      | > loss_disc: 2.62401  (2.62401)
      | > loss_disc_real_0: 0.15260  (0.15260)
      | > loss_disc_real_1: 0.28940  (0.28940)
      | > loss_disc_real_2: 0.21560  (0.21560)
      | > loss_disc_real_3: 0.23761  (0.23761)
      | > loss_disc_real_4: 0.25827  (0.25827)
      | > loss_disc_real_5: 0.22583  (0.22583)
      | > loss_0: 2.62401  (2.62401)
      | > loss_gen: 2.21936  (2.21936)
      | > loss_kl: 3.39704  (3.39704)
      | > loss_feat: 6.03638  (6.03638)
      | > loss_mel: 17.22553  (17.22553)
      | > loss_duration: 2.82314  (2.82314)
      | > loss_1: 31.70146  (31.70146)

    --> STEP: 2
      | > loss_disc: 2.74400  (2.68401)
      | > loss_disc_real_0: 0.09604  (0.12432)
      | > loss_disc_real_1: 0.31334  (0.30137)
      | > loss_disc_real_2: 0.19451  (0.20506)
      | > loss_disc_real_3: 0.24281  (0.24021)
      | > loss_disc_real_4: 0.30612  (0.28220)
      | > loss_disc_real_5: 0.22022  (0.22303)
      | > loss_0: 2.74400  (2.68401)
      | > loss_gen: 1.96078  (2.09007)
      | > loss_kl: 3.07660  (3.23682)
      | > loss_feat: 4.84895  (5.44267)
      | > loss_mel: 16.17185  (16.69869)
      | > loss_duration: 2.77400  (2.79857)
      | > loss_1: 28.83219  (30.26682)

    --> STEP: 3
      | > loss_disc: 2.78879  (2.71893)
      | > loss_disc_real_0: 0.10786  (0.11883)
      | > loss_disc_real_1: 0.29844  (0.30039)
      | > loss_disc_real_2: 0.22643  (0.21218)
      | > loss_disc_real_3: 0.21334  (0.23125)
      | > loss_disc_real_4: 0.30571  (0.29004)
      | > loss_disc_real_5: 0.21500  (0.22035)
      | > loss_0: 2.78879  (2.71893)
      | > loss_gen: 1.90015  (2.02676)
      | > loss_kl: 2.75416  (3.07593)
      | > loss_feat: 4.14758  (5.01097)
      | > loss_mel: 15.27858  (16.22532)
      | > loss_duration: 2.76940  (2.78885)
      | > loss_1: 26.84986  (29.12784)

    --> STEP: 4
      | > loss_disc: 2.87645  (2.75831)
      | > loss_disc_real_0: 0.06458  (0.10527)
      | > loss_disc_real_1: 0.37779  (0.31974)
      | > loss_disc_real_2: 0.26273  (0.22482)
      | > loss_disc_real_3: 0.22814  (0.23047)
      | > loss_disc_real_4: 0.31299  (0.29577)
      | > loss_disc_real_5: 0.19780  (0.21471)
      | > loss_0: 2.87645  (2.75831)
      | > loss_gen: 1.98272  (2.01575)
      | > loss_kl: 2.19848  (2.85657)
      | > loss_feat: 4.78457  (4.95437)
      | > loss_mel: 16.13405  (16.20250)
      | > loss_duration: 2.74825  (2.77870)
      | > loss_1: 27.84805  (28.80789)

    --> STEP: 5
      | > loss_disc: 2.56909  (2.72047)
      | > loss_disc_real_0: 0.10309  (0.10484)
      | > loss_disc_real_1: 0.23537  (0.30287)
      | > loss_disc_real_2: 0.24038  (0.22793)
      | > loss_disc_real_3: 0.23468  (0.23132)
      | > loss_disc_real_4: 0.26872  (0.29036)
      | > loss_disc_real_5: 0.21961  (0.21569)
      | > loss_0: 2.56909  (2.72047)
      | > loss_gen: 2.11476  (2.03555)
      | > loss_kl: 2.37192  (2.75964)
      | > loss_feat: 5.95435  (5.15436)
      | > loss_mel: 17.48834  (16.45967)
      | > loss_duration: 2.71729  (2.76642)
      | > loss_1: 30.64666  (29.17564)

    --> STEP: 6
      | > loss_disc: 2.66211  (2.71074)
      | > loss_disc_real_0: 0.12694  (0.10852)
      | > loss_disc_real_1: 0.27280  (0.29786)
      | > loss_disc_real_2: 0.19381  (0.22224)
      | > loss_disc_real_3: 0.22851  (0.23085)
      | > loss_disc_real_4: 0.26605  (0.28631)
      | > loss_disc_real_5: 0.20644  (0.21415)
      | > loss_0: 2.66211  (2.71074)
      | > loss_gen: 2.05942  (2.03953)
      | > loss_kl: 3.06170  (2.80998)
      | > loss_feat: 5.76304  (5.25581)
      | > loss_mel: 17.84100  (16.68989)
      | > loss_duration: 2.89074  (2.78714)
      | > loss_1: 31.61590  (29.58236)

    --> STEP: 7
      | > loss_disc: 2.73430  (2.71411)
      | > loss_disc_real_0: 0.08186  (0.10471)
      | > loss_disc_real_1: 0.29985  (0.29814)
      | > loss_disc_real_2: 0.19431  (0.21825)
      | > loss_disc_real_3: 0.17608  (0.22302)
      | > loss_disc_real_4: 0.32253  (0.29148)
      | > loss_disc_real_5: 0.21374  (0.21409)
      | > loss_0: 2.73430  (2.71411)
      | > loss_gen: 1.90644  (2.02052)
      | > loss_kl: 1.98326  (2.69188)
      | > loss_feat: 4.38120  (5.13087)
      | > loss_mel: 16.73102  (16.69577)
      | > loss_duration: 2.77189  (2.78496)
      | > loss_1: 27.77382  (29.32400)

  ! Run is kept in /code/data/runs/benyw-de-October-04-2022_03+15PM-0000000
  | > Synthesizing test sentences.
 Traceback (most recent call last):
   File "/opt/venv/lib/python3.8/site-packages/trainer/trainer.py", line 1533, in fit
     self._fit()
   File "/opt/venv/lib/python3.8/site-packages/trainer/trainer.py", line 1521, in _fit
     self.test_run()
   File "/opt/venv/lib/python3.8/site-packages/trainer/trainer.py", line 1439, in test_run
     test_outputs = self.model.test_run(self.training_assets)
   File "/opt/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
     return func(*args, **kwargs)
   File "/tmp/TTS/TTS/tts/models/vits.py", line 1437, in test_run
     wav, alignment, _, _ = synthesis(
   File "/tmp/TTS/TTS/tts/utils/synthesis.py", line 180, in synthesis
     model.tokenizer.text_to_ids(text, language=language_id),
   File "/tmp/TTS/TTS/tts/utils/text/tokenizer.py", line 107, in text_to_ids
     text = self.text_cleaner(text)
   File "/tmp/TTS/TTS/tts/utils/text/cleaners.py", line 105, in phoneme_cleaners
     text = en_normalize_numbers(text)
   File "/tmp/TTS/TTS/tts/utils/text/english/number_norm.py", line 92, in normalize_numbers
     text = re.sub(_comma_number_re, _remove_commas, text)
   File "/usr/lib/python3.8/re.py", line 210, in sub
     return _compile(pattern, flags).sub(repl, string, count)
 TypeError: expected string or bytes-like object

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.0+cu113",
        "TTS": "0.8.0",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.8.10",
        "version": "#142-Ubuntu SMP Fri Aug 26 12:12:57 UTC 2022"
    }
}

Additional context

No response

RobinE89 commented 1 year ago

I have the same problem. an attempt is made here to synthesize a None object. this occurs both with finetuning and when continuing!

edit: the problem arises when loading the test sentences. inside of vits.py @torch.no_grad() def test_run(self, assets) -> Tuple[Dict, Dict]: test_sentences = self.config.test_sentences instead of the required array from the strings,an array of arrays with individual letters of the respective sentence returns (like [['H', 'e', 'l', 'l', 'o']......])! they are already written incorrectly in the new config (inside of the trained-model-dir), but it doesn't seem to change anything if you just correct them in the config.json.

mobassir94 commented 1 year ago

facing same issue,any update on this problem? how to fix this issue?

RobinE89 commented 1 year ago

as a workaround you can go to the function I mentioned and make strings out of the character arrays again (that works for me). Otherwise we can only hope that the devs will read this as soon as possible or start fine tuning / continue themselves =)

mobassir94 commented 1 year ago

@RobinE89 after solving this error as you mentioned,i faced another warning during training which is

WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.

followed by NaN loss

--> STEP: 29/104 -- GLOBAL_STEP: 217050 | > loss_disc: nan (nan) | > loss_disc_real_0: nan (nan) | > loss_disc_real_1: nan (nan) | > loss_disc_real_2: nan (nan) | > loss_disc_real_3: nan (nan) | > loss_disc_real_4: nan (nan) | > loss_disc_real_5: nan (nan) | > loss_0: nan (nan) | > grad_norm_0: 0.00000 (0.00000) | > loss_gen: nan (nan) | > loss_kl: nan (nan) | > loss_feat: nan (nan) | > loss_mel: 15.34929 (16.04639) | > loss_duration: nan (nan) | > amp_scaler: 0.00000 (0.00000) | > loss_1: nan (nan) | > grad_norm_1: 0.00000 (0.00000) | > current_lr_0: 0.00015 | > current_lr_1: 0.00015 | > step_time: 1.05040 (0.99310) | > loader_time: 0.01950 (0.02812)

note that my dataset contains no NaN, i used same setting to train from scratch and didn't face such issue but after finetuning/continue training i am facing NaN issue. another sad story is best_model.pth got overwritten with nan loss,when i try to retrain again,it starts from nan loss any help please? @erogol

erogol commented 1 year ago

I dont think you should + strings. Try keeping the whole string.

mobassir94 commented 1 year ago

@erogol @RobinE89 i was able to train vits but for some technical difficulties my training got stopped and now when i am trying to retrain the model from last saved best checkpoint i am now getting this error


> Restoring best loss from best_model_11648.pth ...
 > Starting with loaded last best loss 16.256176

 > EPOCH: 0/2000
 --> /home/ansary/Shabab/vits_20_october

 > TRAINING (2022-10-25 23:40:32) 

> DataLoader initialization
| > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: False
| > Number of instances : 6126
 | > Preprocessing samples
 | > Max text length: 114
 | > Min text length: 16
 | > Avg text length: 64.72233104799217
 | 
 | > Max audio length: 276757.0
 | > Min audio length: 49474.0
 | > Avg audio length: 129107.80362389814
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.
['<BLNK>', 'ত', '<BLNK>', 'া', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'এ', '<BLNK>', 'ক', '<BLNK>', 'ট', '<BLNK>', 'া', '<BLNK>', ' ', '<BLNK>', 'ক', '<BLNK>', 'া', '<BLNK>', 'র', '<BLNK>', 'ণ', '<BLNK>', 'ও', '<BLNK>', ' ', '<BLNK>', 'ছ', '<BLNK>', 'ি', '<BLNK>', 'ল', '<BLNK>', '\n', '<BLNK>']['<BLNK>', 'এ', '<BLNK>', 'র', '<BLNK>', 'ি', '<BLNK>', 'ক', '<BLNK>', 'া', '<BLNK>', ' ', '<BLNK>', 'ক', '<BLNK>', 'ো', '<BLNK>', 'হ', '<BLNK>', 'ু', '<BLNK>', 'ট', '<BLNK>', ' ', '<BLNK>', 'য', '<BLNK>', 'ে', '<BLNK>', ' ', '<BLNK>', 'ত', '<BLNK>', 'ু', '<BLNK>', 'ম', '<BLNK>', 'ি', '<BLNK>', ',', '<BLNK>', ' ', '<BLNK>', 'স', '<BLNK>', 'ে', '<BLNK>', ' ', '<BLNK>', 'আ', '<BLNK>', 'ম', '<BLNK>', 'ি', '<BLNK>', ' ', '<BLNK>', 'ট', '<BLNK>', 'ে', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'প', '<BLNK>', 'া', '<BLNK>', 'ই', '<BLNK>', '\n', '<BLNK>']['<BLNK>', 'দ', '<BLNK>', 'ে', '<BLNK>', 'খ', '<BLNK>', 'ে', '<BLNK>', ',', '<BLNK>', ' ', '<BLNK>', 'ও', '<BLNK>', 'ঁ', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'ক', '<BLNK>', 'া', '<BLNK>', 'ছ', '<BLNK>', ' ', '<BLNK>', 'থ', '<BLNK>', 'ে', '<BLNK>', 'ক', '<BLNK>', 'ে', '<BLNK>', ' ', '<BLNK>', 'ফ', '<BLNK>', '্', '<BLNK>', 'র', '<BLNK>', 'া', '<BLNK>', 'ঞ', '<BLNK>', '্', '<BLNK>', 'চ', '<BLNK>', 'া', '<BLNK>', 'ই', '<BLNK>', 'জ', '<BLNK>', 'ি', '<BLNK>', ' ', '<BLNK>', 'ন', '<BLNK>', 'ি', '<BLNK>', 'য়', '<BLNK>', 'ে', '<BLNK>', ' ', '<BLNK>', 'ক', '<BLNK>', 'া', '<BLNK>', 'র', '<BLNK>', 'ব', '<BLNK>', 'া', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'শ', '<BLNK>', 'ু', '<BLNK>', 'র', '<BLNK>', 'ু', '<BLNK>', ' ', '<BLNK>', 'ক', '<BLNK>', 'র', '<BLNK>', 'ি', '<BLNK>', '\n', '<BLNK>']

 [!] Character '\n' not found in the vocabulary. Discarding it. [!] Character '\n' not found in the vocabulary. Discarding it. [!] Character '\n' not found in the vocabulary. Discarding it.

['<BLNK>', 'ক', '<BLNK>', 'ি', '<BLNK>', 'ন', '<BLNK>', '্', '<BLNK>', 'ত', '<BLNK>', '্', '<BLNK>', 'ত', '<BLNK>', ' ', '<BLNK>', 'ক', '<BLNK>', '্', '<BLNK>', 'ষ', '<BLNK>', 'ম', '<BLNK>', 'ত', '<BLNK>', 'া', '<BLNK>', 'য়', '<BLNK>', ',', '<BLNK>', 'আ', '<BLNK>', 'স', '<BLNK>', 'া', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'প', '<BLNK>', 'র', '<BLNK>', 'ে', '<BLNK>', ',', '<BLNK>', 'ত', '<BLNK>', 'া', '<BLNK>', 'ঁ', '<BLNK>', 'ক', '<BLNK>', 'ে', '<BLNK>', ' ', '<BLNK>', 'উ', '<BLNK>', 'দ', '<BLNK>', '্', '<BLNK>', 'ভ', '<BLNK>', '্', '<BLNK>', 'র', '<BLNK>', 'া', '<BLNK>', 'ন', '<BLNK>', '্', '<BLNK>', 'ত', '<BLNK>', ' ', '<BLNK>', 'ম', '<BLNK>', 'ন', '<BLNK>', 'ে', '<BLNK>', ' ', '<BLNK>', 'হ', '<BLNK>', 'চ', '<BLNK>', '্', '<BLNK>', 'ছ', '<BLNK>', 'ে', '<BLNK>', '\n', '<BLNK>']
 [!] Character '\n' not found in the vocabulary. Discarding it.
/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/torch/functional.py:606: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/SpectralOps.cpp:800.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]

   --> STEP: 49/64 -- GLOBAL_STEP: 288050
     | > loss_disc: 2.38776  (2.39514)
     | > loss_disc_real_0: 0.12978  (0.13257)
     | > loss_disc_real_1: 0.19328  (0.19674)
     | > loss_disc_real_2: 0.23138  (0.22002)
     | > loss_disc_real_3: 0.20934  (0.22139)
     | > loss_disc_real_4: 0.21563  (0.21840)
     | > loss_disc_real_5: 0.22416  (0.22545)
     | > loss_0: 2.38776  (2.39514)
     | > grad_norm_0: 18.32785  (16.97663)
     | > loss_gen: 2.49931  (2.45184)
     | > loss_kl: 1.21337  (1.24123)
     | > loss_feat: 8.46184  (8.30167)
     | > loss_mel: 15.62020  (15.57008)
     | > loss_duration: 1.48671  (1.45953)
     | > amp_scaler: 512.00000  (1107.59184)
     | > loss_1: 29.28143  (29.02435)
     | > grad_norm_1: 349.82025  (237.90034)
     | > current_lr_0: 0.00011 
     | > current_lr_1: 0.00011 
     | > step_time: 1.67390  (2.26524)
     | > loader_time: 0.02980  (0.02500)

 > EVALUATION 

> DataLoader initialization
| > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: False
| > Number of instances : 61
 | > Preprocessing samples
 | > Max text length: 92
 | > Min text length: 40
 | > Avg text length: 64.52459016393442
 | 
 | > Max audio length: 171374.0
 | > Min audio length: 68524.0
 | > Avg audio length: 126232.5081967213
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.
['<BLNK>', 'ত', '<BLNK>', 'া', '<BLNK>', 'ঁ', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'জ', '<BLNK>', 'ন', '<BLNK>', '্', '<BLNK>', 'য', '<BLNK>', ' ', '<BLNK>', 'ম', '<BLNK>', 'ে', '<BLNK>', 'ক', '<BLNK>', ' ', '<BLNK>', 'আ', '<BLNK>', 'প', '<BLNK>', 'ে', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'ল', '<BLNK>', 'ো', '<BLNK>', 'ক', '<BLNK>', ' ', '<BLNK>', 'এ', '<BLNK>', 'স', '<BLNK>', 'ে', '<BLNK>', 'ছ', '<BLNK>', 'ে', '<BLNK>', 'ন', '<BLNK>', ',', '<BLNK>', 'ই', '<BLNK>', 'ং', '<BLNK>', 'ল', '<BLNK>', 'ণ', '<BLNK>', '্', '<BLNK>', 'ড', '<BLNK>', ' ', '<BLNK>', 'থ', '<BLNK>', 'ে', '<BLNK>', 'ক', '<BLNK>', 'ে', '<BLNK>', '\n', '<BLNK>']['<BLNK>', 'স', '<BLNK>', '্', '<BLNK>', 'ক', '<BLNK>', 'ু', '<BLNK>', 'ল', '<BLNK>', 'ে', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'প', '<BLNK>', '্', '<BLNK>', 'র', '<BLNK>', 'ধ', '<BLNK>', 'া', '<BLNK>', 'ন', '<BLNK>', ' ', '<BLNK>', 'শ', '<BLNK>', 'ি', '<BLNK>', 'ক', '<BLNK>', '্', '<BLNK>', 'ষ', '<BLNK>', 'ক', '<BLNK>', ',', '<BLNK>', ' ', '<BLNK>', 'প', '<BLNK>', '্', '<BLNK>', 'র', '<BLNK>', 'ণ', '<BLNK>', 'য়', '<BLNK>', 'চ', '<BLNK>', 'ন', '<BLNK>', '্', '<BLNK>', 'দ', '<BLNK>', '্', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'ভ', '<BLNK>', 'ট', '<BLNK>', '্', '<BLNK>', 'ট', '<BLNK>', 'া', '<BLNK>', 'চ', '<BLNK>', 'া', '<BLNK>', 'র', '<BLNK>', '্', '<BLNK>', 'য', '<BLNK>', ',', '<BLNK>', ' ', '<BLNK>', 'ম', '<BLNK>', 'া', '<BLNK>', 'র', '<BLNK>', 'ধ', '<BLNK>', 'র', '<BLNK>', 'ে', '<BLNK>', 'র', '<BLNK>', ' ', '<BLNK>', 'অ', '<BLNK>', 'ভ', '<BLNK>', 'ি', '<BLNK>', 'য', '<BLNK>', 'ো', '<BLNK>', 'গ', '<BLNK>', ' ', '<BLNK>', 'ম', '<BLNK>', 'ে', '<BLNK>', 'ন', '<BLNK>', 'ে', '<BLNK>', ' ', '<BLNK>', 'ন', '<BLNK>', 'ি', '<BLNK>', 'য়', '<BLNK>', 'ে', '<BLNK>', 'ছ', '<BLNK>', 'ে', '<BLNK>', 'ন', '<BLNK>', '\n', '<BLNK>']
 [!] Character '\n' not found in the vocabulary. Discarding it.

 [!] Character '\n' not found in the vocabulary. Discarding it.
 ! Run is kept in /home/ansary/Shabab/vits_20_october
 | > Synthesizing test sentences.
Traceback (most recent call last):
  File "/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/trainer/trainer.py", line 1533, in fit
    self._fit()
  File "/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/trainer/trainer.py", line 1521, in _fit
    self.test_run()
  File "/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/trainer/trainer.py", line 1439, in test_run
    test_outputs = self.model.test_run(self.training_assets)
  File "/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/TTS/tts/models/vits.py", line 1435, in test_run
    wav, alignment, _, _ = synthesis(
  File "/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/TTS/tts/utils/synthesis.py", line 180, in synthesis
    model.tokenizer.text_to_ids(text, language=language_id),
  File "/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/TTS/tts/utils/text/tokenizer.py", line 111, in text_to_ids
    text = self.intersperse_blank_char(text, True)
  File "/home/ansary/anaconda3/envs/mobassir/lib/python3.8/site-packages/TTS/tts/utils/text/tokenizer.py", line 130, in intersperse_blank_char
    result = [char_to_use] * (len(char_sequence) * 2 + 1)
TypeError: object of type 'NoneType' has no len()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/trainer/trainer.py:1533, in Trainer.fit(self)
   1532 try:
-> 1533     self._fit()
   1534     if self.args.rank == 0:

File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/trainer/trainer.py:1521, in Trainer._fit(self)
   1520 if epoch >= self.config.test_delay_epochs and self.args.rank <= 0:
-> 1521     self.test_run()
   1522 self.c_logger.print_epoch_end(
   1523     epoch,
   1524     self.keep_avg_eval.avg_values if self.config.run_eval else self.keep_avg_train.avg_values,
   1525 )

File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/trainer/trainer.py:1439, in Trainer.test_run(self)
   1438     else:
-> 1439         test_outputs = self.model.test_run(self.training_assets)
   1440 elif hasattr(self.model, "test") or (self.num_gpus > 1 and hasattr(self.model.module, "test")):

File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     26 with self.clone():
---> 27     return func(*args, **kwargs)

File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/TTS/tts/models/vits.py:1435, in Vits.test_run(self, assets)
   1434 aux_inputs = self.get_aux_input_from_test_sentences(s_info)
-> 1435 wav, alignment, _, _ = synthesis(
   1436     self,
   1437     aux_inputs["text"],
   1438     self.config,
   1439     "cuda" in str(next(self.parameters()).device),
   1440     speaker_id=aux_inputs["speaker_id"],
   1441     d_vector=aux_inputs["d_vector"],
   1442     style_wav=aux_inputs["style_wav"],
   1443     language_id=aux_inputs["language_id"],
   1444     use_griffin_lim=True,
   1445     do_trim_silence=False,
   1446 ).values()
   1447 test_audios["{}-audio".format(idx)] = wav

File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/TTS/tts/utils/synthesis.py:180, in synthesis(model, text, CONFIG, use_cuda, speaker_id, style_wav, style_text, use_griffin_lim, do_trim_silence, d_vector, language_id)
    178 # convert text to sequence of token IDs
    179 text_inputs = np.asarray(
--> 180     model.tokenizer.text_to_ids(text, language=language_id),
    181     dtype=np.int32,
    182 )
    183 # pass tensors to backend

File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/TTS/tts/utils/text/tokenizer.py:111, in TTSTokenizer.text_to_ids(self, text, language)
    110 if self.add_blank:
--> 111     text = self.intersperse_blank_char(text, True)
    112 if self.use_eos_bos:

File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/TTS/tts/utils/text/tokenizer.py:130, in TTSTokenizer.intersperse_blank_char(self, char_sequence, use_blank_char)
    129 char_to_use = self.characters.blank if use_blank_char else self.characters.pad
--> 130 result = [char_to_use] * (len(char_sequence) * 2 + 1)
    131 result[1::2] = char_sequence

TypeError: object of type 'NoneType' has no len()

During handling of the above exception, another exception occurred:

SystemExit                                Traceback (most recent call last)
File <timed eval>:1

File ~/anaconda3/envs/mobassir/lib/python3.8/site-packages/trainer/trainer.py:1554, in Trainer.fit(self)
   1552 remove_experiment_folder(self.output_path)
   1553 traceback.print_exc()
-> 1554 sys.exit(1)

SystemExit: 1

any help please?

do-web commented 1 year ago

It seams the devs not using --continue_path with the vits model because its not working, always an error:

    text = re.sub(_comma_number_re, _remove_commas, text)
  File "/opt/conda/lib/python3.7/re.py", line 194, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
gnmarten commented 1 year ago

I can confirm that "TypeError: expected string or bytes-like object" occurs when resuming training via --continue_path while finetuning a VITS model. Any solution?

HunterKai commented 1 year ago

@RobinE89 after solving this error as you mentioned,i faced another warning during training which is

WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.

followed by NaN loss

--> STEP: 29/104 -- GLOBAL_STEP: 217050 | > loss_disc: nan (nan) | > loss_disc_real_0: nan (nan) | > loss_disc_real_1: nan (nan) | > loss_disc_real_2: nan (nan) | > loss_disc_real_3: nan (nan) | > loss_disc_real_4: nan (nan) | > loss_disc_real_5: nan (nan) | > loss_0: nan (nan) | > grad_norm_0: 0.00000 (0.00000) | > loss_gen: nan (nan) | > loss_kl: nan (nan) | > loss_feat: nan (nan) | > loss_mel: 15.34929 (16.04639) | > loss_duration: nan (nan) | > amp_scaler: 0.00000 (0.00000) | > loss_1: nan (nan) | > grad_norm_1: 0.00000 (0.00000) | > current_lr_0: 0.00015 | > current_lr_1: 0.00015 | > step_time: 1.05040 (0.99310) | > loader_time: 0.01950 (0.02812)

note that my dataset contains no NaN, i used same setting to train from scratch and didn't face such issue but after finetuning/continue training i am facing NaN issue. another sad story is best_model.pth got overwritten with nan loss,when i try to retrain again,it starts from nan loss any help please? @erogol

Has this problem been solved?

ziyaad30 commented 1 year ago

I added the line test_sentence_file to train_vits.py that points to a text file I have text sentences in. It works, I saw that it copied an empty test_sentence_file into the config.json so I struggled but managed

import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig from TTS.tts.configs.vits_config import VitsConfig from TTS.tts.datasets import load_tts_samples from TTS.tts.models.vits import Vits, VitsAudioConfig from TTS.tts.utils.text.tokenizer import TTSTokenizer from TTS.utils.audio import AudioProcessor

def main(): output_path = os.path.dirname(os.path.abspath(file)) dataset_config = BaseDatasetConfig( formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "./davis/") ) audio_config = VitsAudioConfig( sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None )

config = VitsConfig(
    audio=audio_config,
    run_name="vits_ljspeech",
    batch_size=2,
    eval_batch_size=1,
    batch_group_size=5,
    num_loader_workers=8,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=True,
    print_step=10,
    print_eval=True,
    mixed_precision=True,
    test_sentences_file="./test.txt",
    output_path=output_path,
    datasets=[dataset_config],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

# init model
model = Vits(config, ap, tokenizer, speaker_manager=None)

# init the trainer and 🚀
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

from multiprocessing import Process, freeze_support if name == 'main': freeze_support() # needed for Windows main() `

`

skshadan commented 1 year ago

  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\trainer\trainer.py", line 1591, in fit
    self._fit()
  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\trainer\trainer.py", line 1548, in _fit
    self.test_run()
  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\trainer\trainer.py", line 1466, in test_run
    test_outputs = self.model.test_run(self.training_assets)
  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\TTS\tts\models\vits.py", line 1442, in test_run
    wav, alignment, _, _ = synthesis(
  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\TTS\tts\utils\synthesis.py", line 186, in synthesis
    model.tokenizer.text_to_ids(text, language=language_name),
  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\TTS\tts\utils\text\tokenizer.py", line 108, in text_to_ids
    text = self.text_cleaner(text)
  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\TTS\tts\utils\text\cleaners.py", line 125, in phoneme_cleaners
    text = en_normalize_numbers(text)
  File "E:\KL2.0\CODEZ\Coqui\tts-coqui\TTS\venv\lib\site-packages\TTS\tts\utils\text\english\number_norm.py", line 92, in normalize_numbers
    text = re.sub(_comma_number_re, _remove_commas, text)
  File "C:\Users\shada\AppData\Local\Programs\Python\Python38\lib\re.py", line 208, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object```

how to solve this error???