Closed dhgrs closed 5 years ago
Hi, the training data includes iKala (whole), MedleyDB (those with vocal parts), DSD100 (whole) and a small self-made dataset.
Thanks! In the paper, 20,000 sounds were used to train. So I think you used larger dataset than iKala.
That's true. Actually I also did a little data augmentation and the total amount is about 17hrs. I'm quite surprised that I am able to get rather good result with much smaller dataset than the original work.
Very nice work! I have sevaral questions:
Hi Xiao,
Nice work!
I have a question on stft: If the FFT length is 1024, then the size of the output is 513, right? The paper says that the size of the network input is 512. Shouldn't it be 513?
Hi, sorry for late reply.
You are right about the size of FFT spectrum, and when you feed it into the network, the first bin (dimension) is ignored and the network processes on the other 512 bins. The network input (and output) size is set to 512 because of the limitation in U-Net design: the output size cannot become 513 after the deconvolution processes.
Hi Xiao Very nice work! I have some questions: 1、Did you try to do this job on tensorflow where I applied the same network but the result was terrible. 2、You changed the filters from 55 to 44. why? 3、I added biases .Do you think whether the biases have a negative effect. Thank you!
Hi,
It's nice work! You uploaded
unet.model
. Which dataset was used to train the model? iKala, MedleyDB, DSD100 or other.