f90 / Wave-U-Net

Implementation of the Wave-U-Net for audio source separation
MIT License
824 stars 177 forks source link

Possible to train with number of sources K=1? #37

Closed eeyzl5 closed 5 years ago

eeyzl5 commented 5 years ago

Hello there,

I have to admit that you are doing very very great job on this audio source separation task. After reading through your paper and tried your codes, I would say this is an evaluation of audio source separation methods, especially for music and vocal separation. How comes you are so smart to come up with this idea that using the entire time domain input for training (If I am not misunderstanding your method) that is able to give smoother and more natural audio outputs, compared to those spectrum/STFT based methods. The sound quality (which is a high demand for the music industry) of your prediction is incredibly high, and the accuracy is decent considering you are only using a very very small dataset for training. Excellent job!!

So back to my question. I am really interested in your model and would expect huge improvements with larger training datasets. However, as indicated in your paper, you are using K number of sources for prediction output, so K=2 for vocal separation (if I am not wrong), which requires both accompaniment and clean vocal datasets for training. While accompaniment music is easy to find but clean vocals are not, so this makes it harder to train on a larger dataset. So I am asking, let's say, if I am only interested in the prediction of accompaniment music, is your model able to predict K=1 output which is exactly the accompaniment prediction I want, that I can apply to a large accompaniment training set.

Thanks very much!

f90 commented 5 years ago

Hey, thanks for all the praise!

What you want to do sounds very easily doable with the Wave-U-Net. Just do not use the difference output layer (since it will always create another output that is just the difference between the mix and the sum of all the other outputs), instead use the direct output layer configuration, and change the number of predicted sources to K=1.

For this you need to add this output configuration to the code yourself. A good starting point would be here:

https://github.com/f90/Wave-U-Net/blob/master/Config.py#L42

You could add a new setting for the "task" parameter, e.g. "acc_only" and then add to that if-clause so that for this task, the source_names is just ["accompaniment"]. I think the rest of the code should then work without much modification - it will simply only use accompaniment as an output source.

f90 commented 5 years ago

Closing this for now...