Closed xavriley closed 2 months ago
Hello Xavier,
The F-measure score I reported in the paper was computed on the full mix tracks of MDB, ENST, and RBMA, not the drum_only audio. If I remember correctly, those tracks were unavailable for one dataset. In that case, I merged the track's stems into a single stereo downmix with FFMPEG.
The model I shared is trained and validated exclusively on ADTOF-RGW and ADTOF-YT. I call this regime "adtofAll", hence the name of the model's file. I also did 3-fold cross-validation: I trained three models on three overlapping splits of adtofAll, and tested each one on a different split of MDB, ENST, and RBMA. then I reported the average sum F-measure on the five classes. I only share the first model of those three, hence the "0" in the name of the model's file, but I don't think the three models are very different.
If I look at the score on the drum_only audio of MDB, I do see that the models trained on adtofAll reach an average of 0.87. If they are trained on the five datasets, including MDB, ENST, and RBMA, the F-measure increases to 0.89 on average.
Perfect, thanks for clarifying that!
I was working on an LBD for ISMIR looking at combining ADTOF with drum stem source separation. Initially I thought I had an improvement but my system was measuring drum-only scores and I was comparing with the paper scores for full_mix 🥲
One observation that might be useful however - when I aggressively normalize the (drum only) input audio (-8.0 dB LUFS via pyloudnorm) the score for ADTOF on drum_only rises from 87% to 90.83%. I haven't done a sweep of the normalization level yet so that could potentially be improved upon as well.
This is awesome, I would like to try multi-task learning as well. I will make sure to come see you at ISMIR! Although, I will attend remotely.
Thank you for sharing your insight, I'm curious to learn more. I will close this issue for now, feel free to ask more questions if you need.
Sorry, more questions... 😅
I'm trying to replicate the results from the paper for MDB, ENST and RBMA.
When I run the model from this Github repo over the
drum_only
audio from MDB I'm getting an F-measure score of0.8773
whereas the paper has0.81
as the best result for the MDB set.Am I right in thinking that the released model has seen MDB, ENST and RBMA during training? It just seems strange that the score would be that high otherwise.
If that is the case, would it be possible to publish a model checkpoint that was trained only on the ADTOF-RGW and ADTOF-YT datasets? That might help with the reproducibility.