RamiMatar / Chroma-BSRNN

11 stars 1 forks source link

About the metrics #1

Open deyituo opened 1 year ago

deyituo commented 1 year ago

How is the mean SDR for vocals, acc. etc. ?

RamiMatar commented 1 year ago

mean cSDR for the musDB test set as of now is 6-6.4dB depending on the model (chroma models are about equal at 6.3 and 6.4 and BSRNN at 6~). The models were trained for significantly shorter time than the paper and on 1 GPU, but there might be other reasons why it's not as high as the paper. I only trained and tested with vocals.

deyituo commented 1 year ago

I go around github's implementation and find results are under the paper. And this might be a better(bug less) reimplementation, a little different from original bsrnn in mss task.

https://github.com/sungwon23/BSRNN/blob/main/module.py

deyituo commented 1 year ago

HI, I reimplement BSRNN on last Friday and run an experiment on musdb18. After 19w steps(batch size 4 1 24GB gpu) I can get mean usdr 7.489 median 8.080 for vocals on musdb18(not hq) test set now. It's still improving and I think it could be better after more steps.

RamiMatar commented 1 year ago

did you follow the implementation from the link you provided, github.com/sungwon23/BSRNN…?

Also what do you mean by 19w steps?

By media do you mean personal data not from a dataset?

thanks for updating me! that sounds very promising.

deyituo commented 1 year ago

I make minor updates in this url, but the whole model architecture should be like the paper:

  1. the lstm t/k order is lstm_t_i lstm_k_i lstm_t_i+1 lstm_k_i+1, I think this part is not same as the paper: https://github.com/sungwon23/BSRNN/blob/main/module.py#L46-L60
  2. The band split scheme as V7(41 bands) in original paper: https://github.com/amanteur/BandSplitRNN-Pytorch/blob/main/src/model/modules/utils.py#L51
  3. I add a residual to the mask*x in the mask estimation module as http://arxiv.org/abs/2212.00406, but I think it's not the essential thing.

19w steps means 19 epoches, each epoch contains 10000 steps as original paper. I think smaller batch size of gpus needs more steps, em...

Median is just the median sdr of all test songs, last old days some paper report this metric, but nowadays use mean sdr(usdr).

deyituo commented 1 year ago

I think the original paper should give more details about layernorm and groupnorm's config, which makes it confusing to deal with dimensions N, K, T.

BSRNN in speech enhancement task(http://arxiv.org/abs/2212.00406) uses layernorm for onffine(nonstream) and batchnorm for online(stream) task., Maybe batchnorm is ok, emm...

deyituo commented 1 year ago

This project contains a implementation:

https://github.com/aim-qmul/sdx23-aimless