Beat tracking (from MIDI) model training results

Model architecture: Transformer (4 layers; 128 input dim, 8 heads, 256 hidden dim)

Using time step embedding -- 1 input token = 0.1 sec, each sequence is 30 seconds (300 tokens)

This is the result of training this model for 100 epochs: training result with a recall of ~76%. (Accuracy and loss are weighted by how many 1s and 0s there are in the groundtruth.)

Applying this trained model on a audio-generated MIDI gives the following result (red lines = predicted beats): inference

Hmm, downbeats are mostly there (some missing) but upbeats are seldomly predicted right, and there's also a bunch of noises.

ViSonic-NN / muscribe

Beat tracking (from MIDI) model training results #3