facebookresearch / demucs

Code for the paper Hybrid Spectrogram and Waveform Source Separation
MIT License
7.92k stars 987 forks source link

MUSDB18 Mixture Source: MUSDB18HQ Target Stems TRAIN #75

Open RadioAngurem opened 4 years ago

RadioAngurem commented 4 years ago

Hi, Alexandre, Nicolas and Co. First of all congratulations because after triying OpenUnmix, Spleeter or ConvTasnet code I agreed that Demucs gets more musical tracks than the other models. In the the Musdb18 Dataset specs its said that the original mixture track is the sume of the WAV@44100 stems. Then they take the wav stemms and the mixture and compressed to AAC@256 so the sume of the compressed tracks aren´t exactly equivalent of the mixture compressed track.

So, instead of train the model with AAC or HQ versions of the dataset why not use the AAC compresed mixture track as the source and the original WAV @44100 stems as the target??. The model could learn to get the audio signal lost in the compression process.

james34602 commented 4 years ago

I'm not demucs author, so opinion section: The idea you said is audio superresolution, I'm not sure demucs could fit this kind of problem nicely.

As far as I know, a paper like this [https://github.com/kuleshov/audio-super-res] is somehow just a model-based harmonic exciter, once information loss in compression, there is no way to recover it.

If any model does anything, high frequency of human voice could be the only possible meaningful signal can be recovered by some kind of model. For music lost high frequency part, not much can be recovered, since music is too complex and can be very unpredictable.

RadioAngurem commented 4 years ago

After reading the paper about superresolution you recommended me I have beeen looking for more recent papers in that field and there are several groups researching the use of GAN networks to denoise and enhance audio: https://arxiv.org/abs/2001.05532 Improving GANs for Speech Enhancement - Huy Phan&Others https://arxiv.org/abs/1910.12620 Perceptual Speech Enhancement via Generative Adversarial Networks - Sherif Abdulatif&Others https://arxiv.org/abs/1903.09027 Bandwidth Extension on Raw Audio via Generative Adversarial Networks - Sung Kim&Others https://arxiv.org/abs/1911.03952 Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model - Seyyed Saeed Sarfjoo&Others

The tracks produced by DEMUCS suffer from some kind of noise. Could it be applied a SEGAN network to each one of the four outputs of DEMUCS to produce better tracks?.

james34602 commented 4 years ago

Any time domain processing could potentially introduce this kind of noise, the noise coming from demucs is complicated, perhaps a combination of noise generated in both linear and nonlinear function, however, you will still expect some noise coming out if all internal convolution filter is linear.

The tracks produced by DEMUCS suffer from some kind of noise. Could it be applied a SEGAN network to each one of the four outputs of DEMUCS to produce better tracks?.

The answer is no, I'm sure you can't, all these papers you searched is somehow doing bandwidth extension, which cannot remove such noise, the noise of demucs can be cause by nonlinear activation function, fast changing in time, the way to remove the noise is train your demucs with much wider dataset.

Return to the question you ask for the possibility of training demucs to output high frequency components. The answer is yes, most neural network contain nonlinear activation functions, nonlinearity is major possible way to generate harmonics, with correct model, you can generate HQ output correctly that similar to training dataset.

Finally, I think any "learning LQ input and produce HQ output" approach is nonsense, before you trying to train such thing or "design" your own architecture to "superres" your signal, please think about:

  1. Can neural network generate 24 bit samples from 16 bit samples to recover dynamic range?
  2. Can the model learn the relationship even you got large dataset?
  3. Do the "low quality input" contain meaningful relation between the "high quality ground truth"?