facebookresearch / demucs

Code for the paper Hybrid Spectrogram and Waveform Source Separation
MIT License
8.26k stars 1.05k forks source link

Specific noise in output #303

Open unemployed-denizen opened 2 years ago

unemployed-denizen commented 2 years ago

I'm recently training Demucs(Not the Hybrid one) on 5 seconds slices sampled at 8000Hz. I also make sure that the input is of valid_length(5 x 8000) and the reference are center trimmed to the output size. The dataset is MUSDBHQ training set. For the model, I use all the original settings except for the 2 times resampling. For the loss, I’m only using MSE. However, it seems that there been a special kind of noise (probably at a specific frequency) occurs throughout the output. Here is a visualization. The blue line is reference, while the orange one is the noise. noise I'm sure that the model is working, since it is trying to distinguish between vocals and accompaniment. model Did anyone ever come across this issue? I would be really appreciating if some one could help pointing out where the problem is.

adefossez commented 2 years ago

the first plot you give seems really low in volume, that is probably possible. for the second one would it be possible to have some samples ? is blue the ground truth vocal or the mixture ?

dingjibang commented 2 years ago

image

Yes, it will be more noticeable on the spectrogram. I'm training a guitar track with hdemucs and initially I thought it was a problem with the noise floor in the training data (usually instruments recorded with a microphone have this kind of noise), but when I used a 100% noise-free guitar(made by synthesizer) as the training data, I still find this "zizzy" noise.

I came up with some rules:

  1. The noise will decrease with the increase of epoch, but will not disappear
  2. The noise will appear where there should be no sound. For example, a normal song may be silent for a few seconds at the beginning. At this time, the noise will be very obvious. When the volume of the song increases, the noise will disappear (or be covered by the sound of the song, making it impossible to hear)
  3. This is also the case on kuielab/mdx-net
  4. This noise does not belong to the song itself, it is made out of nothing
  5. This noise is not continuous, but a pulse-like sound that can be emitted dozens of times a second

How to reproduce

  1. Use only 1 train data training 1 epoch.(just for quick reproduction, I actually used 300 pairs of data for training)
  2. Listening to the result, you can find obvious sizzling noise.

What the noise sounds like https://user-images.githubusercontent.com/8450073/160237576-f698854a-240d-4474-801c-bffd907cfce4.mp4

  1. When the song (mixture.wav) doesn't have any sound here, playing guitar.wav or no_guitar.wav generated by demucs can hear noise.
  2. When guitar.wav and no_guitar.wav are played together, there will be a sound that does not exist in the original song and is concentrated at 11025khz (you can see the spectrogram), but this 11025khz is not necessarily fixed. In the same training data, kuielab/mdx-net even has two lines (19000khz & 7373khz) so I don't think the frequency is meaningful

So I want to know how these strange 'zizzy' noises come about? Please.

unemployed-denizen commented 2 years ago

the first plot you give seems really low in volume, that is probably possible. for the second one would it be possible to have some samples ? is blue the ground truth vocal or the mixture ?

Sorry for the confusion, for the 2nd plot, orange is estimation, blue is mixture. It is irrelevant to noise, I uploaded that just for justifying that the model is working.

unemployed-denizen commented 2 years ago

image

Yes, it will be more noticeable on the spectrogram. I'm training a guitar track with hdemucs and initially I thought it was a problem with the noise floor in the training data (usually instruments recorded with a microphone have this kind of noise), but when I used a 100% noise-free guitar(made by synthesizer) as the training data, I still find this "zizzy" noise.

I came up with some rules:

  1. The noise will decrease with the increase of epoch, but will not disappear
  2. The noise will appear where there should be no sound. For example, a normal song may be silent for a few seconds at the beginning. At this time, the noise will be very obvious. When the volume of the song increases, the noise will disappear (or be covered by the sound of the song, making it impossible to hear)
  3. This is also the case on kuielab/mdx-net
  4. This noise does not belong to the song itself, it is made out of nothing
  5. This noise is not continuous, but a pulse-like sound that can be emitted dozens of times a second

How to reproduce

  1. Use only 1 train data training 1 epoch.(just for quick reproduction, I actually used 300 pairs of data for training)
  2. Listening to the result, you can find obvious sizzling noise.

What the noise sounds like https://user-images.githubusercontent.com/8450073/160237576-f698854a-240d-4474-801c-bffd907cfce4.mp4

  1. When the song (mixture.wav) doesn't have any sound here, playing guitar.wav or no_guitar.wav generated by demucs can hear noise.
  2. When guitar.wav and no_guitar.wav are played together, there will be a sound that does not exist in the original song and is concentrated at 11025khz (you can see the spectrogram), but this 11025khz is not necessarily fixed. In the same training data, kuielab/mdx-net even has two lines (19000khz & 7373khz) so I don't think the frequency is meaningful

So I want to know how these strange 'zizzy' noises come about? Please.

In my case the noise is not only appearing in the slient part, but all across the song. Maybe my low sampling rate makes it more prominent, or I have missed some important settings.

Anyway, after some investigation, I think the noise might be caused by transposed convolution. Here are some relevant things that I have found:

dingjibang commented 2 years ago

image

yes you are right, the noise all across the song, not only part of beginning

image And also this example image in (arXiv:2010.14356) is very interesting, there is indeed a line at about (not necessarily exact) 11025khz

Another whole day of training yesterday, the bad thing is that although the SDR is increasing, the noise is not decreasing with it...I will try to train for a few more days