Open Ashraf-Ali-aa opened 1 year ago
yes it can make the model more robust to noise and reverberation.
I don't see this doing much, if applied to the wavlm training then sure, but since the content model is locked and cant be trained I don't think this would have much of an effect on the VC model.
I don't see this doing much, if applied to the wavlm training then sure, but since the content model is locked and cant be trained I don't think this would have much of an effect on the VC model.
the content is obtained by wavlm and bottleneck extractor, so i think if trained with noisy speech, the bottleneck extractor will learn to extract clean content from noisy wavlm feature. noisy wav -> wavlm -> noisy ssl feature -> bottleneck extractor -> clean content. (and current: sr augmented wav -> wavlm -> ssl feature -> bottleneck extractor -> content. )
I think a better way of extracting speak content might be to use WHAMR! to add artificial reverberation along with background noise to the training dataset in order for the WavLM to extract content, this would be useful since it can get extract the content that is not from studio grade audio. This is the same technique whisper AI uses for ASR
https://wham.whisper.ai/