Using time step embedding -- 1 input token = 0.1 sec, each sequence is 30 seconds (300 tokens)
This is the result of training this model for 100 epochs:
with a recall of ~76%. (Accuracy and loss are weighted by how many 1s and 0s there are in the groundtruth.)
Applying this trained model on a audio-generated MIDI gives the following result (red lines = predicted beats):
Hmm, downbeats are mostly there (some missing) but upbeats are seldomly predicted right, and there's also a bunch of noises.
Model architecture: Transformer (4 layers; 128 input dim, 8 heads, 256 hidden dim)
Using time step embedding -- 1 input token = 0.1 sec, each sequence is
30
seconds (300 tokens)This is the result of training this model for 100 epochs: with a recall of ~76%. (Accuracy and loss are weighted by how many
1
s and0
s there are in the groundtruth.)Applying this trained model on a audio-generated MIDI gives the following result (red lines = predicted beats):
Hmm, downbeats are mostly there (some missing) but upbeats are seldomly predicted right, and there's also a bunch of noises.