Closed royaltongue closed 1 year ago
The issue here lies with the clustering pre-processing step, at the beginning of training. Due to the embedding-based speech model, a speech style embedding must be provided. By default, this should be the most "normal" sounding speaking style. For that I approximate it by running clustering (k=10) on all the audio clips' embeddings, and selecting the centroid of the largest cluster.
The issue here is that you only have 7 audio files, whereas the clustering is hard-coded to use 10 clusters. This breaks the clustering, as there's less files to cluster than there are clusters to try to form.
7 files is an extremely low number of audio files to use. Generally you'd expect at least 100, 200, or more audio files. An explicit error message may indeed be a good thing to add here, to say this, but the actual solution needs to be that you get more training data, because 7 files will probably not yield you much value.
Settings:
Output:
Error: