Rikorose / DeepFilterNet

Noise supression using deep filtering
https://huggingface.co/spaces/hshr/DeepFilterNet2
Other
2.51k stars 230 forks source link

Question about deepfilter2 code #119

Closed Liu-tj closed 1 year ago

Liu-tj commented 2 years ago

Thanks for your awesome work! And I am confusing about the pad_feat/pad_spec and df_op function so I open this issue to check it out. First, I try to test your trained model, and the class DfNet() in deepfilternet2.py

self.pad_feat = nn.ConstantPad2d((0, 0, -p.conv_lookahead, p.conv_lookahead), 0.0)
self.pad_spec = nn.ConstantPad3d((0, 0, 0, 0, -p.df_lookahead, p.df_lookahead), 0.0)
self.pad_out = nn.Identity()

for line 430-432,444-445

feat_erb = self.pad_feat(feat_erb)
feat_spec = self.pad_feat(feat_spec)
e0, e1, e2, e3, emb, c0, lsnr = self.enc(feat_erb, feat_spec)

spec_f = self.pad_spec(spec)
spec_f = self.df_op(spec_f, df_coefs)

My question is, a. In nn.ConstantPad2/3d, -p.df_lookahead=2 means to remove the data, so is there 2 frames of information missing during training? b. self.df_op is Causal/ not-Causal model? For example, is the first frame data calculated using 0,0,0,0 and 3 frames?

Thanks!

Rikorose commented 2 years ago

Hi, thanks for your interest in DeepFilterNet.

a. Yes it removes data at the end of the signal, depending on the lookahead, essentially rotating the input data. If you are using a lookahead of e.g. 2 frames, it will zero pad 2 frames on the right side and truncate 2 frames on the left side. So apart from the boarder, the whole signal will be delayed by the specified lookahead. b. The whole model is causal and will introduce an algorithmic delay of max(conv_lookahead, df_lookaead) frames.

andyweiqiu commented 2 years ago

@Rikorose Hello, I implemented a real one frame in and one frame out before(real time), but I set conv_lookahead/ dF_lookahead =0. Can you modify these two parameters to train a pre-trained model?

Rikorose commented 2 years ago

I might eventually, but will not give eta. If you need it earlier than later, please train a model yourself.

andyweiqiu commented 2 years ago

OK, thank you! Looking forward to your new pre-trained model.

stonelazy commented 2 years ago

@Rikorose Hello, I implemented a real one frame in and one frame out before(real time), but I set conv_lookahead/ dF_lookahead =0. Can you modify these two parameters to train a pre-trained model?

@andyweiqiu Is this a proprietary code if not, by any chance is it possible to publish your code for this ?

andyweiqiu commented 2 years ago

@Rikorose Hello, I implemented a real one frame in and one frame out before(real time), but I set conv_lookahead/ dF_lookahead =0. Can you modify these two parameters to train a pre-trained model?

@andyweiqiu Is this a proprietary code if not, by any chance is it possible to publish your code for this ?

Sorry, I trained the model by setting conv_lookahead/ dF_lookahead =0 and then implemented the prediction in pure c++ on iOS. You can implement a streaming CONV and GRU based on Python. The conv_lookahead/ dF_lookahead =0 is required for the model to inference in real time, otherwise it is difficult to implement.

ZengBinky commented 2 years ago

Hi, thanks for your interest in DeepFilterNet.

a. Yes it removes data at the end of the signal, depending on the lookahead, essentially rotating the input data. If you are using a lookahead of e.g. 2 frames, it will zero pad 2 frames on the right side and truncate 2 frames on the left side. So apart from the boarder, the whole signal will be delayed by the specified lookahead. b. The whole model is causal and will introduce an algorithmic delay of max(conv_lookahead, df_lookaead) frames.

Hi, @Rikorose I think the algorithmic delay should be (conv_lookahead 2 frames+ df_lookaead 2 frames = 4) frames because df_op is applied in the result of the first stage, is it right? For the final output of a frequency bin in the current frame, the df_coefs (1x5) needs a lookahead with 2 frames, and the 1st stage enhanced spectrogram consists of the past two frames, the current frames, and the next two frames. NOTE THAT the last frames (the second frame after the current frame) need the information of its the next two frames to do a 3x3 convolution operation in the first convolution layer.

Rikorose commented 2 years ago

Hi ZengBinky, you are right, the model mistakenly needs two more frames of lookahead. I have some new models coming up though:

  1. A model without any conv_lookahead and df_lookahead resulting in a total algorithmic latency including STFT of 20 ms.
  2. A slightly modified architecture where DF is applied on the noisy spectrum and does not depend on stage 1.
ZengBinky commented 2 years ago

Hi ZengBinky, you are right, the model mistakenly needs two more frames of lookahead. I have some new models coming up though:

  1. A model without any conv_lookahead and df_lookahead resulting in a total algorithmic latency including STFT of 20 ms.
  2. A slightly modified architecture where DF is applied on the noisy spectrum and does not depend on stage 1.
  1. The modification would increase the learning difficulty of DF coefficients theoretically? Does two-stage speech enhancement make more sense? An alternative solution is that DF still depends on stage 1 and only uses the past 4 frames and the current frame. The performance is similar.

I'm also curious if the df_coefs network does not depend on stage 1 as you said, how does the model select from the enhanced spectrum of stage 1 and stage 2 for the lower 96 frequency bins?

Rikorose commented 2 years ago

In this case, it does not use the enhanced spectrum from stage 1, but applies DF to the noisy spectrum.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.