facebookresearch / denoiser

Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)We provide a PyTorch implementation of the paper Real Time Speech Enhancement in the Waveform Domain. In which, we present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities.
Other
1.68k stars 302 forks source link

Question about the frame length and frame shift . #44

Open XIEchoAH opened 3 years ago

XIEchoAH commented 3 years ago

Hello,sorry to disturb you. I read the paper and code, but still confused about the frame length and frame shift of the audios.

In the Training paragraph, it said " With this setup, the causal DEMUCS processes audio has a frame size of 37 ms and a stride of 16 ms."

Here, why the frame length and frame shift is 37 and 16ms ? How is it calculated?

Hopefully to hear from you.

adefossez commented 3 years ago

Frame length is how much audio you need to process to output the first bit of audio. The formula is obtained from calculating the receptive field of 5 stacked convolutions with kernel size 8 and stride 4. You need this receptive field to be full to compute the first LSTM step. Then the stride is given by the overall stride of 5 stacked convolutions with a stride of 4, so 4^5.

Note that with 37ms of input of audio, the model will only output 16ms, due to the overlap in the output of the transposed convolutions. Then every time it receives 16ms, it will output another 16ms of audio.

Hasko1415 commented 2 years ago

Hi, sorry to ask the same question here again.

I followed your elaboration from issue #82 which explained 37ms (597 / 16k) latency from the U-Net structure.

But I am still confused about the 16ms length of the output. The formula is given above 4 (stride) ^5 (depth) . Could you explain more details on this formula? How does this formula work actually?

Thank you in advance for your reply.

stonelazy commented 2 years ago

According to this doc that i had read for calculating Receptive field length, 4**5 seems to fit the description of receptive field length.

------
input_layer: n = 1; r = 1; j = 1; start = 0.5
------
------
conv1: n = -1; r = 4; j = 8; start = 5.0
------
------
conv2: n = -2; r = 16; j = 36; start = 27.0
------
------
conv3: n = -2; r = 64; j = 148; start = 99.0
------
------
conv4: n = -2; r = 256; j = 596; start = 387.0
------
------
conv5: n = -2; r = 1024; j = 2388; start = 1539.0
adefossez commented 2 years ago

The model up sample the audio by a factor of 4 before feeding it to the model. Thus in the original sample rate the overall stride is 4^5 / 4 = 256 which gives 256/16000=16ms.