Paper hub: related works, what we've read, and what we are trying

Evan-Zhao commented 11 months ago

MP3 to MIDI: now using the piano transcription inference framework from ByteDance. Related paper: High-resolution Piano Transcription with Pedals by Regressing Onsets and Offsets Times, 2020.

The audio must be single instrument, and that instrument must be piano.
Resilience in face of noise / low-quality audio is unclear.
The framework works well under these given conditions, outputting MIDI that is aurally very similar to the original. Note lengths are not quite precise when pedal is used.

MIDI to MusicXML: having tried PM2S (ISMIR 2022), whose accuracy is lacking in beat tracking & note quantization, we're now looking for better frameworks to do this.

PM2S also offers prediction models for hand tracking (piano) and time / key signature changes, which are still valuable. (Need to evaluate thoroughly.)

Evan-Zhao commented 11 months ago

The trail went cold when following references and citations of PM2S.
Rhythm quantization is one of the right term for this task, but there aren't many papers on this topic at all.
Beat tracking is a slightly narrower task with a good number of existing works. Note however that many of them assume and predict a globally constant tempo, which is not what we want (ideally).

Evan-Zhao commented 11 months ago

https://archives.ismir.net/ismir2019/paper/000058.pdf ISMIR 2019

cpyang123 commented 11 months ago

PM2S also doesn't provide any code for the transformation from its model output to MusicXML, so we'll need to develop, or find some packages to do this.

Evan-Zhao commented 11 months ago

Beat tracking: we are unsatisfied by the accuracy of PM2S beat tracking and decide to do it ourselves. The goal is to train a transformer model to predict where the beats land in the time series of the input MIDI. There are 2 input embedding schemes we proposed regarding what constitutes a "token" that makes up a sequence:

Fixed time span per token: creating a token for every T (=0.05) seconds; a token is a 128-dim vector denoting what notes are "on" at the moment.
- We implemented this and the model didn't fit to the data (the loss didn't decrease significantly). In an ablation (?) study, we found the model cannot learn where the onsets are from the input (which is just a torch.diff() away). Reasons for this are not yet clear.
One note per token: creating a token for every note, containing (1) the time span between this and the previous/next note, (2) the duration of this note, and (3) the pitch and velocity of this note. Each feature should be embedded into multi-dim vectors (not sure how yet).
- Advantage (compared to the previous approach): the input is less sparse; drawback: per-note input yields per-note output, and then it's unclear what to do with off-note beats.

Evan-Zhao commented 11 months ago

Upon further inspection it seems that majority works on beat tracking takes audio directly as input -- such as the state of the art: BeatNet. Makes sense, right? Could we directly use such a tool (in parallel to our audio-to-midi framework), or could we use some ideas from those to make our own beat tracking better?

cpyang123 commented 11 months ago

As the embedding currently used is too sparce and does not produce adequate results, ie. training loss does not decrease in a reasonable time, we are exploring note-based embedding with positional encoding based on time stamps, where every note is a token.

ViSonic-NN / muscribe

Paper hub: related works, what we've read, and what we are trying #1