f90 / Wave-U-Net

Implementation of the Wave-U-Net for audio source separation
MIT License
844 stars 177 forks source link

Question - Why not train on smaller patches #31

Closed lminer closed 5 years ago

lminer commented 5 years ago

I noticed that you run convolutions over the entire 16384 frames rather than processing a song in smaller patches. Is there a reason for this decision? Doesn't this increase the memory requirements and lower the ability to randomize data?

f90 commented 5 years ago

You are right that larger input and output size for each audio excerpt means that we need to choose a lower batch size so everything fits into memory. In my case, 16384 might sound a lot, but still allowed me to train using a GPU with 8GB RAM and a batch size of 16, which is a perfectly fine batch size.

It's possible to train from smaller or larger patches within the Wave-U-Net framework, so if you want to reduce the input and output size and change the batch size, you can manipulate the num_frames and batch_size parameter entry in the model configuration code in Config.py. You might get slightly better results with even larger batch sizes. However, there is one limitation:

With a certain number of layers in the Wave-U-Net, the input has to have a certain minimum size since in each downsampling block the number of features is reduced by about half. If the input was too small, one could not compute some of the deeper, high-level features in the later layers as no features are left anymore. This makes sense since the network has a certain receptive field size and you want to give an input at least as long as the receptive field.

If you want to reduce input size even more, this means you would have to reduce the number of layers (num_layers ). This can make training faster, as the network is less deep and less inputs have to be processed, however this also means that the network cannot look ahead or look back very far into the past/future to improve its predictions.

Does that make sense?

lminer commented 5 years ago

It does makes sense, but still seems high. The paper "SINGING VOICE SEPARATION WITH DEEP U-NET CONVOLUTIONAL NETWORKS" had a patch size of 128, but I guess that was for the magnitude spectrogram so maybe it isn't comparable. 16384 frames is roughly how much time?

f90 commented 5 years ago

Yes due to the spectrogram nature it's not directly comparable. But in terms of input size, if you use a STFT to compute a spectrogram using a hop size of 256 audio samples and a window size of 512 samples, and take 128 time frames, then each time frame would end up with 256 frequency bins, so you would have a 128x256 input patch, that covers 128x256/22050=1.48 seconds of audio if you use 22KHz. So you have 128x256=32768 inputs, meaning twice the number of inputs, compared to the Wave-U-Net, although the latter would cover only 16384/22050=0.74 seconds of audio. So in terms of pure input dimensionality, you actually get the same number of audio duration for both models for a given number of inputs.

How you start processing these inputs afterwards with your neural net and how much memory and time this consumes is then a different question of course.

Hope that clears it up?

lminer commented 5 years ago

Perfectly. Thanks so much!