Xiao-Ming / UNet-VocalSeparation-Chainer

A Chainer implementation of U-Net singing voice separation model
90 stars 23 forks source link

Which dataset was used to train? #1

Closed dhgrs closed 5 years ago

dhgrs commented 6 years ago

It's nice work! You uploaded unet.model. Which dataset was used to train the model? iKala, MedleyDB, DSD100 or other.

Xiao-Ming commented 6 years ago

Hi, the training data includes iKala (whole), MedleyDB (those with vocal parts), DSD100 (whole) and a small self-made dataset.

dhgrs commented 6 years ago

Thanks! In the paper, 20,000 sounds were used to train. So I think you used larger dataset than iKala.

Xiao-Ming commented 6 years ago

That's true. Actually I also did a little data augmentation and the total amount is about 17hrs. I'm quite surprised that I am able to get rather good result with much smaller dataset than the original work.

melspectrum007 commented 6 years ago

Very nice work! I have sevaral questions:

  1. what about your training time for you 17 hrs data?
  2. I found that for hiphop song, vocal/instrument separation is not so well, is the training set including these song(such as hiphop) ?
Xiao-Ming commented 6 years ago
  1. I trained it on an AliYun cloud GPU instance with NVIDIA Tesla M40, and it took about 6 hrs before the loss converges.
  2. The contents in the dataset are mainly pop songs, I guess there are not many hiohop tracks (MedleyDB includes at least one, I remember). However, the demonstration on ISMIR poster session by the authors of the original paper included separation results of some hiphop songs, and they were near-perfect, actually.
jibrahim80 commented 6 years ago

Hi Xiao,

Nice work!

I have a question on stft: If the FFT length is 1024, then the size of the output is 513, right? The paper says that the size of the network input is 512. Shouldn't it be 513?

Xiao-Ming commented 6 years ago

Hi, sorry for late reply.

You are right about the size of FFT spectrum, and when you feed it into the network, the first bin (dimension) is ignored and the network processes on the other 512 bins. The network input (and output) size is set to 512 because of the limitation in U-Net design: the output size cannot become 513 after the deconvolution processes.

Ninacll commented 6 years ago

Hi Xiao Very nice work! I have some questions: 1、Did you try to do this job on tensorflow where I applied the same network but the result was terrible. 2、You changed the filters from 55 to 44. why? 3、I added biases .Do you think whether the biases have a negative effect. Thank you!

Xiao-Ming commented 6 years ago

Hi,

  1. I only did it on Chainer framework.
  2. I had some problem setting the correct layer sizes with filter size 5x5, so I made some minor change. I guess there might be some mistakes in my implementation details.
  3. I am not clear about the effect of using (or removing) biases. I think it's difficult to discover the actual effects of those parameters in such a complicated computational network.