facebookresearch / denoiser

Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)We provide a PyTorch implementation of the paper Real Time Speech Enhancement in the Waveform Domain. In which, we present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities.
Other
1.65k stars 302 forks source link

Couple of suggestions #81

Closed stolpa4 closed 3 years ago

stolpa4 commented 3 years ago

Hello, I have two suggestions for your denoiser.

The first one is: you explicitly suppose that a signal processed by your system inherently lacks a DC component (and that's why you normalize it only by its STD). But I suggest you (maybe not in this, but in some future projects in this area) to consider also a demeaning procedure, as whenever your assumption may hold for the underlying process, it does usually not hold for a concrete signal and this small DC part may influence the neural net performance.

By simply introducing the demeaning in your normalization procedure I could increase PESQ by ~0.1 and STOI by ~0.04.

And I don't think you should trust me, just dedicate a small fraction of your resources to hold an experiment by yourself and see, if this works.

For me - it does (I used all the random-seeds (python-random, numpy, torch, torch.cuda) as well as CUDA deterministic algorithms, so both experiments were exactly the same at all stages except for this demeaning procedure)).

The second one is a little bit more subtle, and you can check and apply it without changing anything in your neural net.

Look at your code from the DemucsStreamer class, feed method (demucs.py lines 305 - 308):

                mono = frame.mean(0)
                variance = (mono**2).mean()
                self.variance = variance / self.frames + (1 - 1 / self.frames) * self.variance
                frame = frame / (demucs.floor + math.sqrt(self.variance))

So basically, what do you do here? You take (frame_length + resample_lookahead) samples, compute a running variance of this (assuming the mean of the underlying process is zero), and then normalize this part by the updated running STD.

It means that you do use some overlapping (due to the resample_lookahead part) in this computation, which affects overall variance and influences the overall system performance.

I suggest you compute the variance without overlapping:

                mono = frame.mean(0)[:-self.resample_lookahead]
                variance = (mono**2).mean()
                self.variance = variance / self.frames + (1 - 1 / self.frames) * self.variance
                frame = frame / (demucs.floor + math.sqrt(self.variance))

that way you can get rid of this overlapping (not entirely though, as you also use the stride property and process the signal with overlapping windows, but it seems to be a good consensus) and make your variance estimation more adequate.

This minor change affects your system's overall performance, and you can verify this immediately by introducing the patch and setting random seed in your test code.

adefossez commented 3 years ago

thanks for the suggestions :) Indeed accounting for the mean normalization could be beneficial, although a difference of 0.1 PESQ and 0.04 STOI is limited and probably does not lead to any hearable difference. In future work we would probably include such a term.

For the second point, i would have to look into the overlapping buffer, but I don't expect it to change the output in any hearable way, the resample buffer being quite limited.

stolpa4 commented 3 years ago

thanks for the suggestions :) Indeed accounting for the mean normalization could be beneficial, although a difference of 0.1 PESQ and 0.04 STOI is limited and probably does not lead to any hearable difference. In future work we would probably include such a term.

For the second point, i would have to look into the overlapping buffer, but I don't expect it to change the output in any hearable way, the resample buffer being quite limited.

Yes, I agree, nothing changes in a severe way. However, the differences between the results are still pretty audible.