facebookresearch / denoiser

Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)We provide a PyTorch implementation of the paper Real Time Speech Enhancement in the Waveform Domain. In which, we present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities.
Other
1.64k stars 301 forks source link

Resample kernel #58

Open zyy341 opened 3 years ago

zyy341 commented 3 years ago

Hi, in the resample kernel part,when generating hann window to truncated sinc function,i confused why creat a 4 /times zeros hann window and then select the odd part. Is there any difference to diretly creat a 2 /times zeros hann window? https://github.com/facebookresearch/denoiser/blob/e4d61a156456cbe88f238d7338936c00fd654c3f/denoiser/resample.py#L52-L53

adefossez commented 3 years ago

I don't think it is equivalent. I'm using the formula from this method https://ccrma.stanford.edu/~jos/resample/. It requires explicitely creating a window of a given size. Then the sinc formula gives that the contribution for the even time steps is zero, so that I just skip those entirely by taking only odd time steps, and odd terms is the one given by the sinc. It is not really the same as taking a smaller hann window.

stolpa4 commented 3 years ago

@adefossez Sorry for the stupid question. But it's not quite obvious what do you mean when you say "the sinc formula gives that the contribution for the even time steps is zero". We know, that the sinc(pi * x) function is zero when x is integer (except for 0, though), and if we use x to denote a sample time, that is x == Fs * t, then for any time step t == n / Fs, we get a zero. (n != 0 and is assumed to be integer, of course). From such a perspective, it is really not obvious why did you compute the window in such a way.

I mean, it's clearly not the equivalent of a smaller hann window, but still

adefossez commented 3 years ago

There are two sample rate here, the original and target one. Here the original sample rate is F, and target one is F / 2. Following https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html , we need to performan sinc interpolation with lowpassing at F/2 (the smallest of the two sample rate). Given y[i] the original wavform timeseries, we know that the ideal filtered continuous signal y'(t), is equal to

y'(t) = 1/2 * sum_i y[i] sinc(pi F/2 * (t - i / F ))

Now, we just have to sample y'(t) every 2 / F seconds:

y'[j] = y'(2 * j / F) = 1/2 sum_i y[i] sinc(pi F/2 * (2 * j / F - i / F)
                             = 1/2 sum_i y[i] sinc(pi (j - i/2))

so you see that while even positions will always be zero, because i / 2 is an integer, odd positions will not. We skip the multiplication of the even position and only evaluate the kernel for odd positions, but nonetheless you have to use the original window size, because that is the formula for a finite impulse response approximation of a filter.

stolpa4 commented 3 years ago

Oh finally, it took me a day to understand such an obvious thing! @adefossez, thank you for that explanation!

adefossez commented 3 years ago

It is far from obvious, it took me quite a bit of time to get it right :)