RookieJunChen / FullSubNet-plus

The official PyTorch implementation of "FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement".
Apache License 2.0
242 stars 55 forks source link

Query regarding Causality of Model #11

Closed stonelazy closed 2 years ago

stonelazy commented 2 years ago

First of all thank you so much for making your implementation public. I have a query regarding causality of the model published.

In the paper it was proposed that the proposed architecture is real-time and i could even see the Inferencer code dealing with chunks of audio. Yet, i came across from one of the comments that the model published in the paper/ implementation available here in Github is non-causal.
Incase if it's not non-causal, would it be possible to list down the changes that are needed to be done to make it causal ? Thanks.

RookieJunChen commented 2 years ago

By 'real-time' in the paper, we mean that the rtf of the model is less than 1. The non-causality of the model is mainly caused by the avg-pooling step in the MulCA module, which uses information from the entire time period. So, if you want to implement a casual model, try doing a dynamic avg-pooling (averaging only the previous frame and the current frame) to replace the original avg-pooling in MulCA?

stonelazy commented 2 years ago

Thanks for the lightning reply !
I didn't realize until now that real-time and causality imply two different meaning until now.

So, if you want to implement a casual model.

My doubt was that, if the current model isn't causal then how is it that the denoising on a chunk of audio actually works properly ?

RookieJunChen commented 2 years ago

First, the chunk processing is inconsistent with the definition of casuality. Casuality means that only the information of the current frame and previous frames can be seen, chunk is a time window containing some frames. My model can do the streaming operation of subchunk because within the chunk, I do not need to guarantee casual, the model can see the future frames within that chunk.

stonelazy commented 2 years ago

1.) Agreed with the definitions, my understanding is that a model that is non-causal will not be able to do a real-time streaming operation. By real-time streaming operation, i mean inferring a 20ms input. Can you comment on this ?

My model can do the streaming operation of subchunk because within the chunk

2.) So, does this mean FullSubNet-Plus can be used for streaming inference i.e 20ms input (or a relatively small amount of input audio length for real-time inference) ?
3.) If yes to 2, then what exactly will you not be able to do so when the model is non-causal ?

Thanks.

RookieJunChen commented 2 years ago

20ms is too short, even less than one frame for me here. So this situation is equivalent to degradation in order to stream just one frame at a time, into a form that requires casual? I have not dealt with this case, maybe you can try it yourself?

stonelazy commented 2 years ago

So this situation is equivalent to degradation in order to stream just one frame at a time, into a form that requires casual?

Based on my exposure, in a typical real-time/streaming scenario usecases, it is expected that the model serves for a small chunk say from min-20ms to max-100ms input audio length. If anything more than that, then jitter buffer in the client will add empty frames and cause distortion to audio.

20ms is too short, even less than one frame for me here.

a.) Can you mention what is the min frames/length of audio with which your model should be inferred ?

b.) Can you pls answer 3rd question in my prev comment If yes to 2, what is the limitation of a model being non-causal ?

RookieJunChen commented 2 years ago

About a), it is described in my paper. Regarding b), I have no specific experience in landing. If you are interested, you can explore it yourself