Closed 0417keito closed 4 months ago
Yes, our HuBERT units are extracted by performing k-means on the features of HuBERT layer 9 output. This is the correct way to obtain HuBERT units, and it's also the implementation used in fairseq. You can address the issue you mentioned by padding the input wavform with 0.
Thank you for telling us! This Tokenizer and the paper have helped me a lot!
Sorry for the repeated questions about distillation with HuBERT units as pseudolabels. In the paper, it is written to take the cross-entropy of the quantised output qt of the first RVQ layer using the projection matrix A. How can I use the quantised output of the first RVQ layer, which is published in speech-resynthesis, HuBERT unit has a maximum number of clusters of 200, but have you trained your own HuBERT unit separately from this?
RVQ outputs both quantized across all layers (q) and quantized of first layer (q_1) at the same time during training (code). . For pseudolabels prediction, we input q to decoder for speech-resynthesis and input q_1 to a softmax classifier for pedicting pseudolabels. In fact, the projection matrix A is the weight matrix of the softmax classifier in this approach.
Thank you for answering my questions politely, even though I asked them many times. I appreciate learning so many things from you.
Is HuBERT unit a method that performs k-means on the features of HuBERT output, as implemented in speech2unit described at https://github.com/facebookresearch/speech-resynthesis?
Also, when I use that method to get the HuBERT unit, using the speech waveform of shape (1, 32000), the output of the first rvq layer is of shape (1, 100, 128) for HuBERT features and of shape (1, 99, 768) and (1, 99) for HuBERT unit features. Is this the right way to get the HuBERT unit?