ViSonic-NN / muscribe

MIT License
0 stars 0 forks source link

End-to-end Results using BeatNet #4

Open Evan-Zhao opened 1 year ago

Evan-Zhao commented 1 year ago

Combining piano transcription inference with BeatNet for beat tracking gives decent results. Both take the audio directly as input and produce MIDI and beat location + in-measure beat counting respectively. We're now using them together with PM2S's key signature and hand part detection, and a small script of ~300 lines combines these 4 pieces of information into a MusicXML score. The end result is usable and takes a fraction of time to polish in a score typesetting software (like Musescore), compared to manually transcribing from scratch.

Although the output quality is acceptable in many cases, there is still major space for improvement. Notably, deficient rhythm quantization and hand part tracking take the longest to fix later manually.

Rhythm quantization. Currently the onset and duration of a note in quarters is computed, heuristically, from $$q_n = \text{frac}^*\left(\frac{d_n}{b_e - b_s}\right)$$

(where $d_n$ is the note's duration and $b_s$ and $b_e$ are the start and end of the beat the note falls in). $\text{frac}^*$ is a heuristic function that selects a denominator from $(1, 2, 4, 8)$, preferring small denominator while still keeping the rounding error reasonable. It is not perfect and is particularly sensitive to (1) in-measure rubato and (2) even small errors in beat locations.

More accurate rhythm quantization results can come from beat-rhythm joint detection (like PM2S, although their results are bad compared to BeatNet), and also structural similarities in neighboring measures: image

Above is an example of output from the current rhythm quantization method. The rhythm pattern in all the 6 measures are the same, but due to slight rubato across the measures, the rhythms are presented with slight difference (that are annoying to fix manually).

Hand part tracking. The RNN that PM2S provides seems to rely more on pitch than an understanding of voices (i.e. where is the melody vs. the accompaniment), and it has a hard time when the voices cross. Example: image

when it should be image