Closed narrietal closed 1 year ago
Hi,
the model structure implemented by me is designed for non-real-time denoising. If you want to perform real-time denoising, you need to: 1. change CNN and Attention layers to meet real-time requirement, and 2. you can't feed a 32ms segment to the model, you should also feed a 2s segment, where the last 32ms segment is the audio you want to denoise, and the first 1968ms segment is used as the context.
Hi,
Thank you for such a quick response.
I appreciate the information, could you guide me a bit more on what kind of changes I should make to the network (CNN and Att. layers)? Perhaps, could you point me to some good resource or similar project where to find more information?
Hi,
The CNN and Attention layers I used in this model are non-causal layers,you need to change these layer to causal layers. 《A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement》introduces the causal CNN layer.
You can also google 'causal CNN' and 'causal-attention' for more information.
As the similar project, you can visit MTFAA-Net.
Hi,
Thanks for sharing the code.
I trained the model with audio lengths of 2s (I changed the n_frames parameter in the asat function accordingly). Also, the STFT is computed with a window of 32ms and 8ms of overlap.
I would like to perform real-time denoising on single frames of 32ms of length. However, at inference time the network only does a proper denoising with 2s segments, but it does a poor job with 32ms segments.
Do you know why I am experiencing this behaviour and how I could fix achieve my goal?