Remove the sample duration 10s limit when Inferencing using pretrained models，

eagomez2 / upf-smc-speech-enhancement-thesis

Deep Noise Suppression for Real Time Speech Enhancement in a Single Channel Wide Band Scenario

Creative Commons Attribution 4.0 International

24 stars 9 forks source link

Remove the sample duration 10s limit when Inferencing using pretrained models， #4

Closed zuowanbushiwo closed 2 years ago

zuowanbushiwo commented 2 years ago

Hi @eagomez2 thanks for your open source work， very helpful to me . When I use the following command to do inference on my own data, and the length of the data is not the same .

 python predict.py <input_dir>  <output_dir> pretrained_models/DTLN_BiLSTM_500h.tar -m dtln_bilstm

The result is that only the first 10s of the data will be processed.

When I try to modify this code, it always gives an error. https://github.com/eagomez2/upf-smc-speech-enhancement-thesis/blob/f03395fecef5e8834247499f4dc5820200d727f4/src/predict.py#L100-L101

    input, _pair(output_size), _pair(kernel_size), _pair(dilation), _pair(padding), _pair(stride)
RuntimeError: Given output_size=(1, 160000), kernel_size=(1, 512), dilation=(1, 1), padding=(0, 0), stride=(1, 128), expected size of input's dimension 2 to match the calculated number of sliding blocks 1 * 1247 = 1247, but got input.size(2)=3747.

Is there any way to fix it? need retrain？

Looking forward to your reply All the best

eagomez2 commented 2 years ago

Hi @zuowanbushiwo ,

This can be changed indeed. It was done like that because those are the default values for the DNS Challenge dataset that I used to train the model. In order to change this behavior you should need two things:

Change line 101 to match your desired length frames=sample_rate * desired_duration and then tell the model to expect an audio of a different length. Both DTLN and CRUSE have a sample_duration parameter to control this. If you change this value (either for a constant or dynamically to adjust to your desired audio length on the fly) you should not need to retrain the model.

zuowanbushiwo commented 2 years ago

Hi @eagomez2 Thanks a lot for the guidance, I now know how to modify it. By the way，Is it possible to add a chunk-by-chunk (chunk size equal hop_size) real-time inference feature? Thanks! best wishes

eagomez2 commented 2 years ago

Hi @zuowanbushiwo , It is possible. You could do it for example using sounddevice to receive the audio in real time frame by frame. The model as is cannot be plugin it directly so process in such way, but with the following changes you should be able to reuse the trained weights:

Change any STFT iSTFT by FFT and iFFT, respectively
Be sure to pass the LSTM hidden and cell state from time step n-1 to time step n
For all this, you can use a real time audio frame of hop_size and an internal buffer of fft_size in such a way that each time you can produce hop_size valid output samples, while providing `fft_size`` samples for the inference. This will imply doing the overlap-add procedure manually before sending the audio to the output.

The repo of the original DTLN has this implemented for tensorflow, you can check it out in more details here. A similar procedure can be done for CRUSE, although the necessary FFT/iFFT config may have to be slightly different.

zuowanbushiwo commented 2 years ago

Hi @eagomez2 That's very kind of you ，your really did a great favor of me ！ thanks！