DDMAL / salami-data-public

89 stars 18 forks source link

Parsing instruments as functions [bug?] #12

Open bmcfee opened 8 years ago

bmcfee commented 8 years ago

Splitting off from #10: the instrument annotations are distinct from "functions", and as far as I understand, should not be included in _functions annotations. @jblsmith correct me if I'm wrong here?

If that's the case, there are several parsing errors in the pre-parsed csv files (see below).

At one point, I had cleaned this stuff up in a notebook, but I think it would be all around better to just fix/rewrite the parser. (Since I don't speak ruby, I'd just redo it in python.) Any objections?


 ⌂ py35   master  ~/data/salami-data-public/annotations 
 →  cut -f2  */parsed/*_functions.txt  |sort |uniq
a'/a''
applause
backing
bagpipes
banjo
bass
b/c'
break
Break
Bridge
build
call_and_response
Chorus
Coda
Code
contrasting_middle
count-in
crowd_sounds
da_capo
d/b
Development
dialog
End
Exposition
Fade-out
female
first
groove
guitar
gypsy
hammond
harpsichord
Head
Instrumental
Interlude
Intro
Main_Theme
male
muted
no_function
organ
ostinato
out
outro
Outro
&pause
piano
pick-up
piece_1
piece_2
post-cadential
post-chorus
post-verse
Pre-Chorus
Pre-Verse
Recap
response
ritornello
Secondary_theme
Secondary_Theme
silence
Silence
Solo
spoken
spoken_voice
stage_sounds
stage_speaking
steel
strings
tag
Theme
third
Transition
trumpet
variation
variation_1
variation_2
Verse
violin
vocalizations
vocals
voice
w/dialog
jblsmith commented 8 years ago

There isn't a bug in the parsing here, per se—the source of all these instrument labels in the function labels is that these instrument labels are missing a bracket (either opening or closing), which is what distinguishes the functions from instrument tags. Instrument tags are supposed to be open, close, or self-contained, like "(trumpet", "bass)" or "(vocal)".

The function vocabulary is intended to be limited, to a set of I think around 20 terms. But at certain points in the development of the data, we allowed more terms (and didn't go back and revise all the existing data as the process evolved), and annotators occasionally used their best judgement to add more (like "piece_1" and "piece_2" for when a particular track really seemed to consist of two independent songs).

jblsmith commented 8 years ago

Anyway, as it happens I already re-wrote the parser in Python; I should polish it up and upload it.

bmcfee commented 8 years ago

I see -- so, had the functions actually kept to a closed vocab, it would be possible to tease out instrumentation without needing parentheses. But, as the data currently exists, that's not possible?

It does seem like there are some legitimate bugs though: eg "d/b" or "b/c'", which really are lower-case segments that the annotator couldn't decide on. These seem rare though, and maybe could be fixed with a little bit of additional manual inspection?

At any rate, it does seem like the behavior here is not as intended: things that are not "functions" end up in the "functions" annotations, primarily because of missing parens. Would it be possible to go through and add parens to anything that's obviously non-functional? This could be done programmatically without too much work, since we have a finite sample and a relatively small vocab.

jblsmith commented 8 years ago

I think that to truly fix the annotations will require manual inspection. However, you could apply conservative patches in the meantime.

For functions: ignore function words that fall outside the agreed-upon vocabulary (see page 9 of the Annotator's Guide).

For instruments: close up any tags that are left open or were never opened. For example, an unclosed "(vocal" becomes "(vocal)", an unopened "vocal)" becomes "(vocal)".

But the conservative patch you apply would depend on your usage. If you're training a neural net with positive and negative examples of clips with certain leading instrumentation, your strategy may change.

jblsmith commented 8 years ago

To elaborate: I would like to go in and manually fix some of the instrument issues, and any function / letter label issues that clearly derive from typos. But some issues I would rather leave unfixed, like an annotator using a special function word, or throwing up their hands and saying "b/c" in lieu of "b" or "c". In these cases, I would rather just make available a standardized (but not human-authored) version.

For functions, this would mean choosing a mapping of all non-standard function words to standard words, the same way the Annotator's Guide anticipates that you could simplify the set {"pre-verse", "pre-chorus", "interlude", "transition"} --> "transition".

And for non-standard letters, a standard could be: "assign ambiguous cases a new letter."

bmcfee commented 8 years ago

I think that to truly fix the annotations will require manual inspection. However, you could apply conservative patches in the meantime.

Okay. I've done this kind of thing before, but I would much prefer that we have a standardized "clean" version that everyone can use.

For instruments: close up any tags that are left open or were never opened. For example, an unclosed "(vocal" becomes "(vocal)", an unopened "vocal)" becomes "(vocal)".

Isn't that just equivalent to discarding the annotation? The labels should apply to labels, not boundaries, so if you have a boundary marked as (vocal) then the duration over which that label applies is 0.

More generally: how do you feel about migrating the whole thing to an interval-based representation instead of boundaries? It would solve a lot of headaches.

In these cases, I would rather just make available a standardized (but not human-authored) version.

Agreed. Any ideas on how to do that? They seem like such a limited set of cases that it may as well be done manually ; though maybe things are different on the full (private) dataset.

For functions, this would mean choosing a mapping of all non-standard function words to standard words, the same way the Annotator's Guide anticipates that you could simplify the set {"pre-verse", "pre-chorus", "interlude", "transition"} --> "transition".

Sure. It seems like your vocabulary file already does the hard work there

And for non-standard letters, a standard could be: "assign ambiguous cases a new letter."

What's a "non-standard letter"? You mean like b/c'?