ChristianBergler / ORCA-CLEAN

ORCA-CLEAN: A Deep Denoising Toolkit for Killer Whale Communication
GNU General Public License v3.0
10 stars 4 forks source link

N2N Datapreparation Question #1

Closed tibuch closed 3 years ago

tibuch commented 3 years ago

Hi,

I saw your manuscript and was intrigued by the title. I find this cross-domain work very interesting. After reading over the manuscript I have a question regarding the data preparation for the Noise2Noise training.

In your manuscript you write the following:

We have transferred the idea of Noise2Noise [11] to the field of bioacoustics, by corrupting noisy original spectrograms via different additive noise variants

My interpretation of this would be that you have a noisy spectogram s_n1 = s + n1 where s is the unknown ground truth signal corrupted by additive noise n1. Then you created a second noisy observation s_n2 = s_n1 + n2 with n2 != n1. Based on my Noise2Noise understanding training with these data s_n1 and s_n2 would not converge to s but converge to s_n1 yielding no denoising effect.

My question is: Why do the result images look denoised? As far as I can tell you train a network which performs a denoising and segmentation task simultaneously. What would happen if you only train for denoising?

Best wishes!

ChristianBergler commented 3 years ago

Hi tibuch,

thanks for reading the paper and asking the question.

You are almost correct with your understanding of Noise2Noise missing a little detail. First of all: The idea is to train a denoiser without having any ground truth. So, we were using the original "noisy" spectrogram and distort it even more by adding additional noise variants and trying to reconstruct the original "noisy" spectrogram back from such distorted versions. In your words: we use s_n1 (which is our original noisy input) and s_n2 (noise corrputed version of s_n1) to train the model. So input is s_n2 and output s_n1. You are right, we have no chance to get the original clean and unknown ground truth "s" from that scenario.
But here is what you are missing: once the model is trained we take as input the original noisy spectrogram as it is and the network now thinks it has to remove noise because it learns during training to remove noise from a corrputed input to restore the original input, which is still noisy. Now, by feeding the model with the original and unseen noisy input we get back an output with less noise (=denoising). So similar noise characteristics as seen during training can be removed from unseen data. Thats why the output looks and is denoised.

Now comming to your second question: Noise2Noise was originally designed for image restoration (see the reference in my paper). When you try to transfer it to the audio domain (as i describe that in my paper), it will remove noise from all areas in a spectrogram (as for an image). However, a spectrogram consists of regions which are not all of equal interest and importance. So while pure noise can be entirely removed, noisy orca vocalizations need to be cleaned in a way that as much noise as possible, but as few parts of the orca voicings as necessary, are eliminated.

Thats why we came up with binary masks acting as an additional attention mechanism in order to teach the network what is important and what not. The binary mask were created before training. And yes these masks can be considered as a binary segmentation between "valuable spectral parts" and "noise".

So to summarize: we use different alternatives (noise corrupted specs + binary mask variants) to teach our model how to denoise. In both cases, we provided the network with an input being "more noisy" than the output (e.g. noise corrputed input vs. original noisy input and orginal noisy input vs binary mask).

I hope that helps for your understanding.

Best wishes!

tibuch commented 3 years ago

I understand the second part and think it is a nice idea to enhance certain regions.

Regarding the Noise2Noise modification part I am still not entirely convinced. The reason N2N works, is that s_n1 = s + n1 and s_n2 = s + n2 with n1 and n2 being to independent samples of a zero-mean distribution. Using these data to train a network with MSE loss converges to s.

The approach you describe could also be seen as training with pairs of simulated noisy images with ground truth images. Where the noisy spectogram is the ground truth and your training input images are the noisy simulation. Now this should converge to something in between the ground truth noisy spectogram and your simulated noisy-noisy spectogram. I think it makes sense that it will result in something smoothed/less noisy, because the network will have a hard time to predict 'random' pixels. But the predicted signal will not be part of the true signal distribution.

So while pure noise can be entirely removed, noisy orca vocalizations need to be cleaned in a way that as much noise as possible, but as few parts of the orca voicings as necessary, are eliminated.

If the orca voicings are your desired signal s and the noise on top has zero-mean I would expect that learned N2N denoising from two identical signals with independent noise would result in clear orca voicings. In such a setting the masking could still help during training to get faster convergence, but I would expect the same performance with and without masking after training converged.

You are also citing Noise2Void a self-supervised approach which could be trained without paired training data. I am curious if you tried N2V on these data?

ChristianBergler commented 3 years ago

Hi,

i have to correct things regarding your s_n1 = s + n1 and s_n2 = s + n2...and "that it results something in between the ground truth noisy spectogram and your simulated noisy-noisy spectogram":

First of all, what we have in our approach is the following: s_n1(=original noise spectrogram) = s + n1 , and s_n2(=corrupted noisy spectrogram) = s_n1 + n2 . s_n2 is the network input. s_n1 is the network output. During training the network will converge between the ground truth noisy spectogram and the simulated noisy-noisy spectogram, because it trys to reconstruct s_n1 out of s_n2. So, as you said, it will converge not to s but to s_n1 (MSE between s_n2 and s_n1 as loss).

However, after the training is finished the model will be evaluated on real world signals, the real-world input to the network is s_n1 and not, as during training, s_n2. Consequently, the final network output, will be something between s_n1 and s, because the network learned during its training/validation/testing to reconstruct s_n1 out of a even more noisy corrupted signal s_n2, meaning s_n2 - "something" = s_n1 (in training). Working with the network after it is finished uses s_n1 signals (e.g. any kind of noisy orca sounds) - "something" = something close to s.

N2V is future work, we are currently planing to also try N2V, but we haven't yet.

If you have still questions, we can also offer a zoom/skype call from our side.

Best Wishes!

tibuch commented 3 years ago

I understand. My only concern would be that s_n1 and s_n2 have different noise distributions and during training the network is conditioned on the s_n2 distribution and during prediction it is applied to the s_n1 distribution. I guess it would be interesting to see how the network performance depends on different noise intensities n2.

N2V is future work, we are currently planing to also try N2V, but we haven't yet.

Looking forward to these results! This might be interesting for you as well. We combined denoising and segmentation into one loss/network to leverage large noisy datasets with only few ground truth annotations.

Best wishes!

ChristianBergler commented 3 years ago

Great to hear that! Yes of course you are right, they have different noise distributions, because they see different sets of synthetic as well as real-world noise etc., using a random selection of a pool of various noise distributions (described in the paper), also with different noise intensities. So during training we also simulate different intensity levels of n2, so that the network hopefully is able to handle later on as much variety as possible during prediction.

Thanks a lot for all your feedback. Really appreciate it !!! Regarding your work: looks very interesting/promising, i will definitely have a deeper look at it, awesome! Moreover, once N2V is ready, i will also let you know.

Best wishes!