Closed ghost closed 7 years ago
I assume you refer to the RNNBeatProcessor
class and the referenced 2011 DAFx.
The original paper used a Mel filtered spectrogram, that's what I used back then and it worked (and would still work). Since then the whole signal pre-processing stuff evolved a bit. I switched to logarithmically filtered spectrograms in 2012, simply they are more useful for other tasks (most importantly note transcription) and I wanted to have the same or very similar signal pre-processing for all my tasks.
It is not a power spectrogram, but rather a simple magnitude spectrogram. It is first filtered (see above) and then scaled logarithmically by taking the logarithm (after adding a constant value of 1). This is inspired by the human ear and these setting has been found advantageous for other purely signal processing based approaches, so I stuck to it.
HTH
Makes sense, thank you! For the first version of the Tensorflow port I think I will stick to your original paper to measure metrics against the provided. I currently have the BLSTM model set up, but I was hoping to get a couple things cleared up.
First, the beat annotated ballroom data you provide seems to specify the times at which the beat occurs. Since you trained your BLSTM as a binary classifier, these times would be the 'beat' and all other times would be 'no_beat'. As we break the audio into frames, it is highly unlikely that the frame start times will align with the beat times exactly, so I assume you use some kind of a tolerance in time? Could you maybe check what it is? I suppose I just need some clarification as to how exactly you labeled your training set.
Second, you mention that a single input vector consists of 120 real numbers. As inputs to RNNs are usually sequences, am I correct in assuming that your training set size is the number of frames for a single audio, your sequence length is 120 with input dimension of 1 (the inputs are just scalars)? After the network learned for some specific audio file, you saved the settings and reran the process for the next audio file?
Finally, you mention taking medians of the Mel bands across time frames, where the number of frames you take is given by 'frame_size' / floor(100). This is fine for the nth-frame where n > 'frame_size'/floor(100), but what did you do for earlier frames? Did you just take as many previous frames as there were available?
Cheers, Sergey
The difference in performance between Mel and logarithmic filters should be negligible.
I simply use a frame rate of 100 frames per second. As targets every frame which has a beat annotation in there is labelled as 'beat', all others as 'no_beat'.
For the inputs every frame is a 120-dimensional vector, i.e. the Mel-filtered magnitude spectrograms with the different window lengths as well as the positive first order differences stacked on top of each other. The median is computed only with the available frames for the first couple of frames, yes.
My training set consisted of N sequences with N being the number of annotated audio pieces. Each sequence has the full length (depending on the audio length). For training we used SGD, adjusting the weights after each sequence but shuffling the sequences for each epoch.
The annotated data set contains seconds (decimal) in the first column and beats in the second. 100 fps will result in several overlapped frames that might cover an annotated beat time. Do all of them count as 'beat' or do you only look at the time a frame starts?
When you say each sequence has full length you mean the total number of frames. Then each 'entry' in the sequence is 120-dimensional vector (the inputs).
I apologize...just trying to get all the details cleared out at once.
I only use a single target frame per beat, e.g. if the first beat is '0.3 1' (beat number 1 at 0.3 seconds) then only frame number 30 would be a 'beat', all others 'no_beat'. Yes, each sequence has a shape of (N, 120).
perfect! Thanks! I'll keep you updated if the network learns something interesting.
@siryog90 Any luck doing the nn in tensorflow? Hoping I can do some of this on gpu 😛
@vertgo Yea I had put together the computational graph on the cpu. The graph itself is simple. Just be mindful of how you batch up training data if you are going to be training on multiple gpus in parallel.
I can dig in and try to put something together for you in tf over the weekend.
This may be slightly related: I am considering re-training the downbeat LSTM for a specific type of music. I was wondering how you deal with the class imbalance given that I guess >95% of frames will be in the "no beat" class.
@siryog90 : any luck so far on tensorflow?
For beats it works as it is, you could however widen the targets to not include only one target frame, but also one or two neighbouring. If you use targets wider than two frames, it has been found advantageous to weight the additional frames with 0.25 to 0.5. HTH.
Thanks - this is very useful information. I will be using clips that are approximately the length of the data used for training (around 30s). Any advise on additional hyper-parameters, i.e. learning rate?
sorry, learning rate was already mentioned and is also in the paper. I meant batch size, regularisers, dropout etc. LSTMs are often difficult to train and a "recipe" may (or may not) generalise across datasets
You have to test different learning rates. I trained with SGD, i.e. batch size of 1. No regularisers, dropout whatsoever.
thanks! I will give it a try!
Hello guys. Fantastic work with the project. I am trying to port some parts of the project to tensorflow, and I am currently implementing the RNNBeatTracker RNN based on the paper you have linked. I am wondering if there is any particular reason you choose to filter the spectrogram with a LogarithmicFilterBank instead of MelFilterBank? Also, any reason why the power spectrum in the Spectrogram class is computed simply by taking the absolute value instead of squaring the absolute value of the Short Time Fourier Transform?