About HuBERT unit - Githubissues

ZhangXInFD / SpeechTokenizer

This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on

https://0nutation.github.io/SpeechTokenizer.github.io/

Apache License 2.0

466 stars 40 forks source link

About HuBERT unit #4

Closed 0417keito closed 4 months ago

0417keito commented 1 year ago

Is HuBERT unit a method that performs k-means on the features of HuBERT output, as implemented in speech2unit described at https://github.com/facebookresearch/speech-resynthesis?

Also, when I use that method to get the HuBERT unit, using the speech waveform of shape (1, 32000), the output of the first rvq layer is of shape (1, 100, 128) for HuBERT features and of shape (1, 99, 768) and (1, 99) for HuBERT unit features. Is this the right way to get the HuBERT unit?

ZhangXInFD commented 1 year ago

Yes, our HuBERT units are extracted by performing k-means on the features of HuBERT layer 9 output. This is the correct way to obtain HuBERT units, and it's also the implementation used in fairseq. You can address the issue you mentioned by padding the input wavform with 0.

0417keito commented 1 year ago

Thank you for telling us! This Tokenizer and the paper have helped me a lot!

0417keito commented 1 year ago

Sorry for the repeated questions about distillation with HuBERT units as pseudolabels. In the paper, it is written to take the cross-entropy of the quantised output qt of the first RVQ layer using the projection matrix A. How can I use the quantised output of the first RVQ layer, which is published in speech-resynthesis, HuBERT unit has a maximum number of clusters of 200, but have you trained your own HuBERT unit separately from this?

ZhangXInFD commented 1 year ago

RVQ outputs both quantized across all layers (q) and quantized of first layer (q_1) at the same time during training (code). . For pseudolabels prediction, we input q to decoder for speech-resynthesis and input q_1 to a softmax classifier for pedicting pseudolabels. In fact, the projection matrix A is the weight matrix of the softmax classifier in this approach.

0417keito commented 1 year ago

Thank you for answering my questions politely, even though I asked them many times. I appreciate learning so many things from you.