The real-time speech enhance is poor

SongJinXue commented 3 years ago

The block length of 32 ms and the block shift of 8 ms for real-time speech enhancement is poor，but a single audio speech enhancement works well. What causes it? How can I improve ? Noisy: The block length of 32 ms and the block shift of 8 ms for real-time speech enhancement : A single audio speech enhancement :

haoxiangsnr commented 3 years ago

Hi Jinxue, thanks for your attention and feedback.

I guess the main reason for this difference is the lack of hidden states and cell states of LSTM. If you pursue frame-wise processing, there are two things:

Except to input the feature frame by frame, changing torch.nn.LSTM class to torch.nn.LSTMCell class is an essential step. Throughout using a for-loop, input the hidden states and cell states of the previous step to the current step.
In addition, you need to modify the normalization method to support the frame-wise mode. To be specific, first, calculate a mean value on each frame. Then, referring to cumulative_laplace_norm, using the previous mean values smooth the current mean value and normalize the current frame feature.

Note that changing from torch.nn.LSTM class to torch.nn.LSTMCell does not cause performance reduction as the former is just an encapsulation of the latter. In addition, I've tested the different normalization methods, at least, the performance of cumulative_laplace_norm and offline_laplace_norm (currently used) is nearly equal. There is another normalization method, named forgetting_norm. forgetting_norm updates the mean value of the current frame only using the feature context in a window with a fixed size. So it may be more suitable for real scenarios but the performance will be slightly worse.

Spelchure commented 3 years ago

After these changes should we retrain the model ? Because Im using the pretrained model. Thanks for help.

haoxiangsnr commented 3 years ago

After these changes should we retrain the model ? Because Im using the pretrained model. Thanks for help.

Hi, this week I will release a cumulative pre-trained model.

Spelchure commented 3 years ago

Thanks. Can you push the code snippet for real-time frame wise processing ?

haoxiangsnr commented 3 years ago

Hi, @Spelchure.

Q: After these changes should we retrain the model ? Because Im using the pretrained model. Thanks for help. A: Here is a pre-trained FullSubNet using cumulative normalization. Its performance is rather close to performance on the FullSubNet using offline normalization.

Q: Can you push the code snippet for real-time frame wise processing ? Sorry that I don't have enough time recently, But in the next month, I would release this frame-wise processing code. But before that, you could try write it by yourself. After downloading the FullSubNet using cumulative normalization, there are two things you need to do are changing the torch.nn.LSTM to torch.nn.LSTMCell and adding a for-loop.

Spelchure commented 3 years ago

Thanks for model and advice.

Spelchure commented 3 years ago

I can't use pretrained cumulative model after changing LSTM to LSTMCell for frame - wise processing. It has error : missing arguments and unexpected arguments in model. It is possible to use cumulative model only inferencing without training ? If it is possible where i am doing wrong ? (Im changing LSTM to LSTMCell in sequence_model.py)

SongJinXue commented 3 years ago

I can't use pretrained cumulative model after changing LSTM to LSTMCell for frame - wise processing. It has error : missing arguments and unexpected arguments in model. It is possible to use cumulative model only inferencing without training ? If it is possible where i am doing wrong ? (Im changing LSTM to LSTMCell in sequence_model.py)

I tested LSTM and LSTMCell , it did not help. Then, I tired to input the hidden states and cell states of the previous step to the current step, which works well.

SongJinXue commented 3 years ago

Hi Jinxue, thanks for your attention and feedback.

I guess the main reason for this difference is the lack of hidden states and cell states of LSTM. If you pursue frame-wise processing, there are two things:

Except to input the feature frame by frame, changing torch.nn.LSTM class to torch.nn.LSTMCell class is an essential step. Throughout using a for-loop, input the hidden states and cell states of the previous step to the current step.

In addition, you need to modify the normalization method to support the frame-wise mode. To be specific, first, calculate a mean value on each frame. Then, referring to cumulative_laplace_norm, using the previous mean values smooth the current mean value and normalize the current frame feature.

Note that changing from torch.nn.LSTM class to torch.nn.LSTMCell does not cause performance reduction as the former is just an encapsulation of the latter. In addition, I've tested the different normalization methods, at least, the performance of cumulative_laplace_norm and offline_laplace_norm (currently used) is nearly equal. There is another normalization method, named forgetting_norm. forgetting_norm updates the mean value of the current frame only using the feature context in a window with a fixed size. So it may be more suitable for real scenarios but the performance will be slightly worse.

Thanks for your advice. I inputed the hidden states and cell states of the previous step to the current step and modified the normalization method referring cumulative_laplace_norm, the real-time speech enhancement was as expected, but the performance is slightly worse. Expect your frame-wise processing code.

haoxiangsnr commented 3 years ago

Hi, Jinxue

Generally speaking, the changing from torch.nn.LSTM to torch.nn.LSTMCell could not cause any performance degradation. Some trivial things that should be paid attention to:

make sure that you are using the new pre-trained cumulative version FullSubNet, i.e.,cum_fullsubnet_best_model_218epochs.tar on the release page.
as you can see, for performance purposes, cumulative norm that I released is written in a compact style, i.e., in advance computing the statistical mean value of all frames for an utterance. You should separate this function using a frame-wise style. The point basically is to ensure that normalizing the current frame using the statistical mean value of previous all frames.

You could try to confirm these trivial things and if you have any further questions please contact me. Of course, if the problem still exists, directly contributing your frame-wise code to this project on GitHub is very welcome.

khanld commented 2 years ago

Hi SongJinXue, can you share the real-time code? I would be so appreciated it. Many thanks for considering my request.

stonelazy commented 2 years ago

I hope the inference code for real-time is this

SEMLLYCAT commented 1 year ago

将 LSTM 更改为 LSTMCell 进行逐帧处理后，我无法使用预训练的累积模型。它有错误：模型中缺少参数和意外参数。是否可以仅使用累积模型进行推理而不进行训练？如果可能我哪里做错了？（我在sequence_model.py中将LSTM更改为LSTMCell）

我测试了 LSTM 和 LSTMCell ，它没有帮助。然后，我厌倦了将上一步的隐藏状态和单元状态输入到当前步骤，效果很好。 I have also tried this part, and there is no difference between LSTM and LSTMCell here, but is the result of frame-by-frame processing unsatisfactory? Could you please provide the implementation of this part. Thank you .

SEMLLYCAT commented 1 year ago

你能分享一下实时代

您好，金雪，感谢您的关注和反馈。我猜测造成这种差异的主要原因是 LSTM 缺乏状态和状态单元。如果你追求逐帧处理，有两件事：

除了隐藏逐帧输入特征之外，torch.nn.LSTM类与torch.nn.LSTMCell类之间的转换是必不可少的一步。在整个使用循环的过程中，将上一步的状态和单元状态输入到当前步骤。

另外，还需要修改归一化方法以支持frame-wise模式。具体来说，首先计算每一帧的动作。然后，参考，使用之前的均值对当前均值进行平滑，屏幕当前帧特征进行归一化。cumulative_laplace_norm

请注意，torch.nn.LSTM此类更改为torch.nn.LSTMCell不会导致性能下降，因为只是之前不久的封装。cumulative_laplace_norm另外，我测试了不同的归一化方法，至少，和（目前使用的）的性能offline_laplace_norm几乎可以。还有另一种标准化方法，称为。仅使用固定大小的窗口中的特征上下文来更新当前帧的控制器。所以可能更适合场景但性能会稍差一些。forgetting_normforgetting_norm

谢谢你的建议。我将上一步的隐藏状态和单元状态输入到当前步骤中，并参考cumulative_laplace_norm修改归一化方法，实时语音增强符合预期，但性能稍差。期待您的逐帧处理代码。

Hello author, Could you please share the revised content of this part of streaming inference? Thank you very much

Audio-WestlakeU / FullSubNet

The real-time speech enhance is poor #29