b04901014 / UUVC

Official implementation for the paper: A Unified One-Shot Prosody and Speaker Conversion System with Self-Supervised Discrete Speech Units.
MIT License
73 stars 9 forks source link

Question about inference #5

Closed dillfrescott closed 1 year ago

dillfrescott commented 1 year ago

How would I change to the hubert extra large model during inference? I tried replacing the name in inference.py but I don't think I got the name right...

b04901014 commented 1 year ago

The textlesslib we used only supports a limited number of pretrained models. To use another pretrained speech unit model out of their support you need to manually load model checkpoints from HuggingFace then apply k-means clustering on the whole dataset to get the speech units.

Then we need to change s2u.py and inference.py according to that new model. You will also need to retrain the model. Unfortunately currently I don't have time to add that support to the repo. But I just pushed an example: s2u_manual_kmeans.py for your reference on how to run manual k-means on pretrained models from Huggingface, if you need to do it yourself.

dillfrescott commented 1 year ago

Thank you!

dillfrescott commented 1 year ago

@b04901014 ok now its giving me this error

Traceback (most recent call last):
  File "s2u_manual_kmeans.py", line 42, in <module>
    labels = kmeans.fit_predict(reps)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/fast_pytorch_kmeans/kmeans.py", line 156, in fit_predict
    batch_size, emb_dim = X.shape
ValueError: too many values to unpack (expected 2)
dillfrescott commented 1 year ago

My bad, it was because I accidently had the new batch in stereo