DanRuta / xva-trainer

UI app for training TTS/VC machine learning models for xVASynth, with several audio pre-processing tools, and dataset creation/management.
92 stars 17 forks source link

Error While Training #9

Closed royaltongue closed 1 year ago

royaltongue commented 1 year ago

Settings: image

Output:

18:41:18 | New Session 
18:41:18 | No graphs.json file found. Starting anew. 
18:41:18 | Dataset: C:/Program Files (x86)/Steam/steamapps/common/xVATrainer/resources/app/datasets//rdfvd_paimon 
18:41:18 | Language: English 
18:41:18 | Checkpoint: ./resources/app/python/xvapitch/pretrained_models/xVAPitch_5820651.pt 
18:41:18 | CUDA device IDs: 0 
18:41:18 | FP16: Disabled 
18:41:18 | Batch size: 6 (Base: 6, GPUs mult: 1) | GAM: 67 -> (402) | Target: 400 
18:41:18 | Outputting model backups every 3 checkpoints 
18:41:19 | Loading model and optimizer state from ./resources/app/python/xvapitch/pretrained_models/xVAPitch_5820651.pt 
18:41:20 | New voice 
18:41:20 | Workers: 3 
18:41:38 | Fine-tune dataset files: 7 
18:45:00 | Priors datasets files: 179007 | Number of datasets: 28 

Error:

Traceback (most recent call last):
  File "server.py", line 227, in handleTrainingLoop
  File "python\xvapitch\xva_train.py", line 137, in handleTrainer
  File "python\xvapitch\xva_train.py", line 557, in start
  File "python\xvapitch\xva_train.py", line 604, in iteration
  File "python\xvapitch\xva_train.py", line 391, in init
  File "C:\Program Files (x86)\Steam\steamapps\common\xVATrainer\.\resources\app\python\xvapitch\get_dataset_emb.py", line 18, in get_emb
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embs)
  File "sklearn\cluster\_kmeans.py", line 1376, in fit
    self._check_params(X)
  File "sklearn\cluster\_kmeans.py", line 1307, in _check_params
    super()._check_params(X)
  File "sklearn\cluster\_kmeans.py", line 828, in _check_params
    raise ValueError(
ValueError: n_samples=7 should be >= n_clusters=10.
DanRuta commented 1 year ago

The issue here lies with the clustering pre-processing step, at the beginning of training. Due to the embedding-based speech model, a speech style embedding must be provided. By default, this should be the most "normal" sounding speaking style. For that I approximate it by running clustering (k=10) on all the audio clips' embeddings, and selecting the centroid of the largest cluster.

The issue here is that you only have 7 audio files, whereas the clustering is hard-coded to use 10 clusters. This breaks the clustering, as there's less files to cluster than there are clusters to try to form.

7 files is an extremely low number of audio files to use. Generally you'd expect at least 100, 200, or more audio files. An explicit error message may indeed be a good thing to add here, to say this, but the actual solution needs to be that you get more training data, because 7 files will probably not yield you much value.