f90 / Wave-U-Net

Implementation of the Wave-U-Net for audio source separation
MIT License
824 stars 177 forks source link

Data Augmentation #32

Closed shoegazerstella closed 5 years ago

shoegazerstella commented 5 years ago

Hi, I am following this paper for performing data augmentation on the musdb. I am using librosa time_stretch and pitch_shift on each sample of the musdb dataset. I then use spempeg to build a new stem file. Unfortunately, the preprocessing of the wave-u-net shows these statistics that seem not to so be good for re-training the network properly:

stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_bass.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_drums.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_other.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_vocals.wav
Maximum absolute deviation from source additivity constraint: 1.015533447265625
Mean absolute deviation from source additivity constraint:    0.09679516069867423

On the musdb website it is also stated that:

Since the mixture is separately encoded as AAC, there there is a small difference between the sum of all sources and the mixture. This difference has no impact on the bsseval evaluation performance.

Some of my code:

SR = 44100
R = 0.1

def timeStretch(y, rate=0):

    y_right = y[:, 0]
    y_left = y[:, 1]

    y_stretched_R = librosa.effects.time_stretch(y_right, rate=rate)
    y_stretched_L = librosa.effects.time_stretch(y_left, rate=rate)

    y_stretched = np.array([y_stretched_R, y_stretched_L])

    return y_stretched

# open stem and retrieve all channels
stem_path = os.path.join(ORIGINAL_STEMS_DIR, f)
info = stempeg.Info(stem_path)
S, _ = stempeg.read_stems(stem_path, info=info)

process_list = [S[0], S[1], S[2], S[3], S[4]]
for audio_to_process in process_list:
    y_stretched = timeStretch(audio_to_process, rate=R)
    stretched_list.append(y_stretched)

# create and save stem
S = np.array(stretched_list)
S = np.swapaxes(S,1,2) #n x samples x channels
stempeg.write_stems(S, output_mp4, rate=SR)

Do you have any idea on what could be the problem here? Thanks a lot!

f90 commented 5 years ago

Hey, are you rewriting the mixture as well? If not, then inputs and outputs don't match anymore and my test loader reports a mismatch. If you timestretch the sources individually, then sum them to get a mixture, and then put this as training data into my code, it should work without showing these high numbers since they just check whether mix = sum of sources.

Recomputing the mixture seems mandatory to do here otherwise inputs and outputs are not even on the same timescale, so training with any network won't really work.

f90 commented 5 years ago

FYI the code where I load the stempeg mixture audio and compare it to the sum of the sources is here:

https://github.com/f90/Wave-U-Net/blob/master/Datasets.py#L265

shoegazerstella commented 5 years ago

Hi, thanks a lot for replying. Correct me if i am wrong: according to the STEM format, the mixture is the first element of the array I call S in my code

#process_list = [mix, drums, bass, accompaniment, vocals]
process_list = [S[0], S[1], S[2], S[3], S[4]]

So I am time_stretching also S[0] together with the other tracks. Now for writing the STEM I am replacing that S[0] with the sum of all the processed tracks as you suggested and the statistics changed a bit but still far from how they should be, right?

stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_bass.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_drums.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_other.wav
stems_augmented/train/ANiMAL - Clinic A_stretched_1.0.stem_vocals.wav
Maximum absolute deviation from source additivity constraint: 0.730133056640625
Mean absolute deviation from source additivity constraint:    0.05901246167623359

I should also mention that I have this warning from stempeg, could that be part of the problem?

UserWarning: For better quality, please install libfdc_aac
  warnings.warn("For better quality, please install libfdc_aac")

Something I do not understand is why you are clipping acc_audio when generating it.

acc_audio = np.clip(sum([stem_audio[key] for key in stem_audio.keys() if key != "vocals"]), -1.0, 1.0)
f90 commented 5 years ago

There might be some problem with the encoding, seeing that the mean absolute deviation is not so high but the maximum one is. So it might be alright overall but locally some encoding inconsistencies produce a high error...

Solution 1: Export your audio to wave, and modify the MUSDB data loading code to load the wave files directly, then you know there should be absolutely no deviation between sum of sources and mix as you don't have any encoding inaccuracy.

Solution 2: If you are absolutely sure you are inputting "proper" data into the system, go ahead and ignore the warning and/or use output_type: direct in the Wave-U-Net to allow it to output all sources unconstrained, so it is capable of outputting sources that do NOT add up to the original mix as well. I would definitely listen to the dataset you produced in this case though to make sure everything is alright.

I am clipping the accompaniment audio just to be sure that i don't generate values out of the [-1,1] range since it's a sum of the individual audio signals, so the amplitudes are summed up. Should not be necessary if the dataset is proper, but doesn't hurt either.

f90 commented 5 years ago

Wait a second I just saw your code here:

process_list = [S[0], S[1], S[2], S[3], S[4]]
for audio_to_process in process_list:
    y_stretched = timeStretch(audio_to_process, rate=R)
    stretched_list.append(y_stretched)

Does that mean you don't compute the mix as sum of the sources but rather time-stretch it as well? That might be the reason here. The time-stretching function might stretch things a bit differently for the sources compared to the mix. You should definitely set mix = sum of sources instead!

f90 commented 5 years ago

It should be something like this:

process_list = [S[1], S[2], S[3], S[4]]
for audio_to_process in process_list:
    y_stretched = timeStretch(audio_to_process, rate=R)
    stretched_list.append(y_stretched)
mix_stretched = np.sum(stretched_list)
stretched_list.insert(0, mix_stretched)
shoegazerstella commented 5 years ago

Does that mean you don't compute the mix as sum of the sources but rather time-stretch it as well? That might be the reason here. The time-stretching function might stretch things a bit differently for the sources compared to the mix. You should definitely set mix = sum of sources instead!

I was doing that before, but then I replaced mix = sum of processed sources as you suggested yesterday. My last comment with statistics comes from this adjustment. As you can see, the mean and max values changed after doing this.

Thanks a lot for proposing the 2 solutions, I will try and see which one is better.

shoegazerstella commented 5 years ago

Some updates:

This is the augmented file:

ffprobe ANiMAL\ -\ Clinic\ A_stretched_1.0.stem.mp4 

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'ANiMAL - Clinic A_stretched_1.0.stem.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2mp41
    encoder         : Lavf58.20.100
  Duration: 00:03:57.89, start: 0.000000, bitrate: 1195 kb/s
    Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 251 kb/s
    Metadata:
      handler_name    : SoundHandler
    Stream #0:2(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 224 kb/s
    Metadata:
      handler_name    : SoundHandler
    Stream #0:3(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 255 kb/s
    Metadata:
      handler_name    : SoundHandler
    Stream #0:4(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 193 kb/s
    Metadata:
      handler_name    : SoundHandler

This is the original one:

ffprobe ANiMAL\ -\ Clinic\ A.stem.mp4 

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'ANiMAL - Clinic A.stem.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 1
    compatible_brands: isom
    creation_time   : 2017-12-16T16:50:00.000000Z
  Duration: 00:03:57.85, start: 0.000000, bitrate: 1288 kb/s
    Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
    Stream #0:2(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
    Stream #0:3(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
    Stream #0:4(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 256 kb/s
    Metadata:
      handler_name    : SoundHandler
    Stream #0:5: Video: png, rgba(pc), 512x512 [SAR 20157:20157 DAR 1:1], 90k tbr, 90k tbn, 90k tbc

I see some inconsistencies on the bitrate.

Also, I am computing the stats before writing the augmented stem and after (from your data loader):

#before 
Maximum absolute deviation from source additivity constraint: 5.551115123125783e-17
Mean absolute deviation from source additivity constraint:    2.595044779849003e-18

# after
Maximum absolute deviation from source additivity constraint: 0.730133056640625
Mean absolute deviation from source additivity constraint:    0.05901246167623359

There must be something wrong with ffmpeg, or do you see something else that's incorrect? I am going to try solution 1. Thanks a lot!

f90 commented 5 years ago

I am not too well versed on the ffmpeg part of the story, but I personally wouldn't trust it to encode things to such a high degree of accuracy that we require. I also had some issues when loading encoded audio in terms of synchronisation where the audio was suddenly misaligned in time, which is obviously very bad in our setting.

But yeah it looks like ffmpeg encoding (settings) is to blame here. I decode all the stems to wave as part of data preparation anyway as it's much faster to load the audio during training that way, so you should probably use solution 1 I proposed and cut out the whole stempeg part completely.

Another solution if time-stretching is not too cpu-intensive is to put it as part of the data augmentation piopeline on-the-fly during training. Saves disk space but might slow down training since batches take longer to be prepared.

shoegazerstella commented 5 years ago

I am using librosa so it takes quite long, can't do it during training unless I find some other solution. Thanks a lot for helping me! :)