Closed huutuongtu closed 4 months ago
Hi @huutuongtu , yep that might work quite well! It should be fairly easy to experiment with that without any retraining as well, since all you need to do is pass new arguments to the y_seq: Tensor, matching_set: Tensor, synth_set: Tensor
arguments of the match(...)
function. Namely, you can keep the synth_set
the default WavLM SSL features, but for the y_seq
(query sequence) and matching_set
(matching set) you can use the features from ASR fine-tuned WavLM.
If you try it out, I'd be interested to hear how it goes!
Hmm, I've tried using the pretrained ASR model wav2vec2-base-100h from Facebook and wavlm patrickvonplaten/wavlm-libri-clean-100h-base-plus, but the results aren't satisfactory (too noisy) :D. Perhaps retraining the vocoder with semantic features could improve the results. Here is the link to some samples: https://drive.google.com/drive/folders/19lVYEi20iOzhxDXv1fiXA0PXtlvV7aHx?usp=sharing
Have you tried using Wavlm, which has been fine-tuned on an ASR dataset, to extract semantic features for querying KNN instead of directly using SSL features? Using KNN to obtain timestamps only, then using the timestamps of the reference Wavlm SSL to generate the output.