Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)We provide a PyTorch implementation of the paper Real Time Speech Enhancement in the Waveform Domain. In which, we present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities.
I am wondering what the best approach would be to adapt the denoiser for multi-channel audio. I have a four microphone array that I would like to apply denoiser to as a pre-processing step.
Can the model.chin and model.chout paramters be changed when performing inference on a network that has been trained on only one channel? Will the inference/forward step adapt if the input tensor is multiple channels of audio (all of the same frame size). I have modified the live.py example to perform sequential forward passes (one for each channel), but obviously this tanks the real time performance.
Any advice on applying denoiser to multi-channel audio would be appreciated.
Hi @TankyFranky,
You can definitely reconfigured the model to get more than one channel as input and output. However, if you are going that way you should train a new model prom scratch.
If you want to use the pre-trained models, so what you did (process each channel independently) would be the best/easiest way. Regarding the real-time constraints, maybe you can process the channels in parallel?
Hello,
I am wondering what the best approach would be to adapt the denoiser for multi-channel audio. I have a four microphone array that I would like to apply denoiser to as a pre-processing step.
Can the
model.chin
andmodel.chout
paramters be changed when performing inference on a network that has been trained on only one channel? Will the inference/forward step adapt if the input tensor is multiple channels of audio (all of the same frame size). I have modified thelive.py
example to perform sequential forward passes (one for each channel), but obviously this tanks the real time performance.Any advice on applying denoiser to multi-channel audio would be appreciated.
Thanks.