Closed Azatiussss closed 2 years ago
You are right, the model is non-casual. By real-time in the paper, we mean that rtf is less than 1. If you want to implement a casual model, try doing a dynamic avg-pooling (averaging only the previous frame and the current frame)?
Ok, thank you very much for fast and detailed response!
Is the model causal? It seems like during training and during inference the ChannelTimeSenseSELayer is used, where average pooling is taken along the frames axis, or I am supposed to process audio chunk-by-chunk to obtain the honest result with usage of only limited look ahead amount of data?
https://github.com/hit-thusz-RookieCJ/FullSubNet-plus/blob/81e84b43d4f716cda1cd065d608f6c7b6758e791/speech_enhance/audio_zen/model/module/attention_model.py#L57-L71