Open deyituo opened 1 year ago
mean cSDR for the musDB test set as of now is 6-6.4dB depending on the model (chroma models are about equal at 6.3 and 6.4 and BSRNN at 6~). The models were trained for significantly shorter time than the paper and on 1 GPU, but there might be other reasons why it's not as high as the paper. I only trained and tested with vocals.
I go around github's implementation and find results are under the paper. And this might be a better(bug less) reimplementation, a little different from original bsrnn in mss task.
HI, I reimplement BSRNN on last Friday and run an experiment on musdb18. After 19w steps(batch size 4 1 24GB gpu) I can get mean usdr 7.489 median 8.080 for vocals on musdb18(not hq) test set now. It's still improving and I think it could be better after more steps.
did you follow the implementation from the link you provided, github.com/sungwon23/BSRNN…?
Also what do you mean by 19w steps?
By media do you mean personal data not from a dataset?
thanks for updating me! that sounds very promising.
I make minor updates in this url, but the whole model architecture should be like the paper:
19w steps means 19 epoches, each epoch contains 10000 steps as original paper. I think smaller batch size of gpus needs more steps, em...
Median is just the median sdr of all test songs, last old days some paper report this metric, but nowadays use mean sdr(usdr).
I think the original paper should give more details about layernorm and groupnorm's config, which makes it confusing to deal with dimensions N, K, T.
BSRNN in speech enhancement task(http://arxiv.org/abs/2212.00406) uses layernorm for onffine(nonstream) and batchnorm for online(stream) task., Maybe batchnorm is ok, emm...
This project contains a implementation:
How is the mean SDR for vocals, acc. etc. ?