Closed stonelazy closed 2 years ago
By 'real-time' in the paper, we mean that the rtf of the model is less than 1. The non-causality of the model is mainly caused by the avg-pooling step in the MulCA module, which uses information from the entire time period. So, if you want to implement a casual model, try doing a dynamic avg-pooling (averaging only the previous frame and the current frame) to replace the original avg-pooling in MulCA?
Thanks for the lightning reply !
I didn't realize until now that real-time
and causality
imply two different meaning until now.
So, if you want to implement a casual model.
My doubt was that, if the current model isn't causal then how is it that the denoising on a chunk of audio actually works properly ?
First, the chunk processing is inconsistent with the definition of casuality. Casuality means that only the information of the current frame and previous frames can be seen, chunk is a time window containing some frames. My model can do the streaming operation of subchunk because within the chunk, I do not need to guarantee casual, the model can see the future frames within that chunk.
1.) Agreed with the definitions, my understanding is that a model that is non-causal will not be able to do a real-time streaming operation. By real-time streaming operation, i mean inferring a 20ms input. Can you comment on this ?
My model can do the streaming operation of subchunk because within the chunk
2.) So, does this mean FullSubNet-Plus can be used for streaming inference i.e 20ms input (or a relatively small amount of input audio length for real-time inference) ?
3.) If yes to 2, then what exactly will you not be able to do so when the model is non-causal ?
Thanks.
20ms is too short, even less than one frame for me here. So this situation is equivalent to degradation in order to stream just one frame at a time, into a form that requires casual? I have not dealt with this case, maybe you can try it yourself?
So this situation is equivalent to degradation in order to stream just one frame at a time, into a form that requires casual?
Based on my exposure, in a typical real-time/streaming scenario usecases, it is expected that the model serves for a small chunk say from min-20ms to max-100ms input audio length. If anything more than that, then jitter buffer in the client will add empty frames and cause distortion to audio.
20ms is too short, even less than one frame for me here.
a.) Can you mention what is the min frames/length of audio with which your model should be inferred ?
b.) Can you pls answer 3rd question in my prev comment If yes to 2, what is the limitation of a model being non-causal
?
About a), it is described in my paper. Regarding b), I have no specific experience in landing. If you are interested, you can explore it yourself
First of all thank you so much for making your implementation public. I have a query regarding causality of the model published.
In the paper it was proposed that the proposed architecture is real-time and i could even see the Inferencer code dealing with chunks of audio. Yet, i came across from one of the comments that the model published in the paper/ implementation available here in Github is non-causal.
Incase if it's not non-causal, would it be possible to list down the changes that are needed to be done to make it causal ? Thanks.