CPJKU / madmom

Python audio and music signal processing library
https://madmom.readthedocs.io
Other
1.31k stars 200 forks source link

reproduce the beat tracking result. #461

Closed fhahaha closed 3 years ago

fhahaha commented 3 years ago

To reproduce the beat tracking result.

I am trying to reproduce the beat tracking model in madmom. However, the result cannot compare with the DBNBeattracking result. I want to know:

  1. what is your training data;
  2. Do you have any training strategy? (non-beat data is much more that beat data)
superbock commented 3 years ago

Please have a look in the respective papers listed as references in the respective classes/methods. The training data is described there. Usually cross entropy error does not require any special counter-measurements against the imbalance of beats/non-beats, however it can increase training speed/performance if you artifically increase the amount of beat targets by also giving the neighbouring frames as additional beat tragets (with less weight).

HTH

sevagh commented 3 years ago

@superbock

Hello - I'm interested in reproducing the results as well. Specifically I'm interested in the "RNNBeatProcessor" -> "DBNBeatTrackingProcessor" chain, which I believe are covered in this series of papers:

Also if I'm not mistaken, these algorithms combined are the same ones submitted to (and which got the best results) in the most recent MIREX audio beat tracking challenges? https://nema.lis.illinois.edu/nema_out/mirex2019/results/abt/smc/summary.html Algorithm SB1 here

>>> proc = DBNBeatTrackingProcessor(fps=100)
>>> proc  # doctest: +ELLIPSIS
<madmom.features.beats.DBNBeatTrackingProcessor object at 0x...>
>>> act = RNNBeatProcessor()('tests/data/audio/sample.wav')
>>> proc(act)
array([0.1 , 0.45, 0.8 , 1.12, 1.48, 1.8 , 2.15, 2.49])

Reading through the 2014 paper describes the following datasets:

As training material for our system, the datasets introducedin [13–15] are used. They are called Ballroom, Hainsworth and SMC respectively. To show the ability of our new algorithm to adapt to various music styles, a very simple approach of splitting the complete dataset into multiple sub-sets according to the original source was chosen

Is the approach of splitting the dataset into multiple subsets described anywhere?

superbock commented 3 years ago

You are right about basically all parts, let me add a few additional notes:

DAFx 2011 was the first system to use BLSTMs for beat tracking. The inputs were some Mel spectrograms, and some median filtered diffs.

For ISMIR 2014 we unified the input representation to be slightly different, i.e. log-scaled and log-filtered spectrograms and their temporal difference, and added a DBN for post-processing.

Regarding the split for the multi-model part, it is formulated a bit unclear unfortunately. Let me try to explain it: we hypothised that whatever data is used to train a model, it specialises towards the music/style of that dataset. So it would be advantageous to be able to chose among a set of models and choose the one most suitable for the music at hand. Thus using more homogeneous datasets (containing only a specific music style) would emphasise the results, because the individual models can specialise more. However, we did not come up with a sophisticated approach of splitting our training dataset regarding music styles, so we used the simplest approach and split the complete training set into smaller datasets "according to their original source", i.e. into Ballroom, Hainsworth, SMC.

We then trained several (specialised) models, one "reference" model on the whole dataset and several others only on subsets thereof, i.e. Ballroom, Hainsworth, SMC. However, we did not train these specialised models from scratch, but first train a single model with all data and after convergence was reached, we derived the specialised models from that model. We basically fine-tuned them (with reduced learn rate) into different directions. The reference model was fine-tuned with all data, the other models with only the data of the subsets. First training normally and then only fine-tuning with specific data has shown to yield better results than training everything from scratch — at least with the model selection approach we used.

Model selection is then accomplished by comparing the individual models' activations to the reference model and choose the one with the smallest mean squared difference. Our intuition was that the individual models overfit on the data and thus i) produce better results on that specific music style, but ii) also (much) worse results on other music styles (similar to the gap between training and test data, which increases the longer training continues if no early stopping or other measures are taken).

Hope this gets clearer now. So, you can come up with any split, as long as it splits the dataset into more homegeneous subsets. Having that said, this whole approach comes with a large overhead, since multiple networks need to be trained and also evaluated.

For the ISMIR 2015 paper we simply optimised the DBN, by using a much sparser representation.

The linked MIREX result was not obtained by a combination of the papers above, but rather a simplified one without the model-selection part which simply averages the predictions of multiple networks. The model was trained with different data however, IIRC on the data listed in the ISMIR 2016 downbeat paper. This would explain the lower SMC results compared to previous years (when SMC was included in the training data).

HTH

sevagh commented 3 years ago

Awesome - thanks a lot for this. So, say I have a hypothesis I want to verify (in my case, it's that using a CQT spectrogram instead of a normal spectrogram could give better beat tracking results).

Would you say then that I could probably ignore the multi-network details and model switching, and simply focus on comparing the train and test accuracy of one single Bi-LSTM (+ DBN post-processing) on a single dataset?

So:

superbock commented 3 years ago

Absolutely, since the altered signal pre-processing should have an (equal) effect on everything further down the road — if there is any. However, depending on the data you use for training, I doubt that there's any noticeable effect on the SMC dataset originating from this change. My gut feeling is that the effect of the data used is much larger and propably outweighs the effect of altered pre-processing.