Task Definition - Instrument Detection

cosmir / openmic-annotator

Annotation framework for annotating data for OpenMIC

MIT License

56 stars 1 forks source link

Task Definition - Instrument Detection #19

Open julian-urbano opened 8 years ago

julian-urbano commented 8 years ago

Let's discuss here a definition of the task and use case as precise as possible. The idea is simple, but it can get complicated once we get into it. As I see it, much of the details we'll have regarding data, annotations, metrics, taxonomy and so on, will be restricted by this.

I see two main use cases, and another two within each:

1) Given a piece of music, identify the instruments that are played in it: 1a) For each window of X milliseconds, return a list of instruments being played (extraction). 1b) Simply return the instruments being played anywhere in the music piece (classification). 2) Retrieve a list of music pieces in which a given instrument is played: 2a) Just return the list (retrieval). 2b) Return the list, but for each music piece also provide a clip (start-end) where the instrument is played (question answering-like).

We should agree on what task we are talking about here. My impression is that 1a and 2b make the most sense, but let's discuss.

For cases 1) it seems alright to feed systems with audio clips, but for cases 2) I think full tracks are more appropriate. Also, for 2) I think the user might be someone who is learning to play an instrument and wants samples to practice. Just an idea.

In terms of performance measures and annotations, I think we can for now go with the typical stuff used in extraction, classification, retrieval and question answering, as I indicated. I'd like to keep this thread just about the task definition and use cases, and later one we'll discuss about metrics, audio, etc.

Comments?

dmcennis commented 8 years ago

track level annotation not in blob form is needed. 1a. Parsable annotations, separable in a database. Also needs a search engine back end. 1b. track level annotation, but parsed.
requires parsed annotations and a search engine. 2a. just above 2b. requires timing info in the annotations, also parsed

Reference #5 - these features are not currently in the planning document for the back end.

Use cases are beautiful. The Story drives roadmaps and versions. But we need to connect dots... @ejhumphrey

julian-urbano commented 8 years ago

What do you mean "we need a search engine"?

For now, let's keep this just about task definition. We'll worry about annotations later on.

dmcennis commented 8 years ago

Your task definitions can not be met by the current design. The exercise is beautiful. Just don't expect to run this task inside a year unless you coordinate with Eric.

On Wed, Aug 31, 2016, 8:21 PM Julián Urbano notifications@github.com wrote:

What do you mean "we need a search engine"?

For now, let's keep this just about task definition. We'll worry about annotations later on.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/omec/bedevere/issues/19#issuecomment-243941299, or mute the thread https://github.com/notifications/unsubscribe-auth/AFItP7XQbqX-0x6L1YRh7Df5nOjwL-zAks5qlhqmgaJpZM4JyLcH .

bmcfee commented 8 years ago

@dmcennis, Julian is a coauthor on the paper that started this project, along with myself and @ejhumphrey. "Coordination", as you put it, ought to be obvious here.

ejhumphrey commented 8 years ago

@julian-urbano to your question, 1a converges to 1b as the time window increases in length, no? so aren't both 1a and 1b classification?

In any event, the use case I keep coming back to is answering the question "what instruments are active in a given sound recording?"

With this in mind, I've been strongly advocating for 1a, on the order of 5-30 seconds. This follows an acoustic parallel to ImageNet 2012, where we care what occurs in a larger observation, but not necessarily where (also like recent iterations of ImageNet, I could see this being an extension in the future). There are then three (non-exclusive) ways to represent what instruments occur during a fixed time-length observation:

i. instrument classes are ranked in descending order of relevance, i.e. ranked retrieval ii. only the relevant instrument classes are returned, i.e. multi-class prediction iii. affinities / likelihoods are returned for all classes, i.e. regression, as we try to model some averaged human response like the Galaxy Zoo Challenge

Again, I'm happy to draw inspiration from ImageNet2012's lead, which takes a hybrid of (i) and (ii), described here, the tl;dr: being

For each image, algorithms will produce a list of at most 5 object categories in the descending order of confidence.

This raises two possible differences from the visual recognition space: one, visual scenes can have a lot going on, so there's no natural upper-bound on how many classes might occur; and two, there's no guarantee every object will be labeled in a scene.

Does this get us closer to a task definition? anything that would still need to be addressed? is this the task we're happy to tackle?

dmcennis commented 8 years ago

@ejhumphrey We are getting close... @julian-urbano what is 'returning the list'. Is this a scan of the back end, hand calculated items, or retrieval of the documents by some other method? The current design description has no knowledge of the annotation syntax or content so you can return a specific items or everything but nothing else. There is no data format. Is this analysis for 'returning the list' external or internal to bedev?

On Wed, Aug 31, 2016, 11:04 PM Eric J. Humphrey notifications@github.com wrote:

@julian-urbano https://github.com/julian-urbano to your question, 1a converges to 1b as the time window increases in length, no? so aren't both 1a and 1b classification?

In any event, the use case I keep coming back to is answering the question "what instruments are active in a given sound recording?"

With this in mind, I've been strongly advocating for 1a, on the order of 5-30 seconds. This follows an acoustic parallel to ImageNet 2012, where we care what occurs in an larger observation, but not necessarily where (also like recent iterations of ImageNet, I could see this being an extension in the future). There are then three (non-exclusive) ways to represent what instruments occur during a fixed time-length observation:

i. instrument classes are ranked in descending order of relevance, i.e. ranked retrieval ii. only the relevant instrument classes are returned, i.e. multi-class prediction iii. affinities / likelihoods are returned for all classes, i.e. regression, as we try to model some averaged human response like the Galaxy Zoo Challenge https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/details/evaluation

Again, I'm happy to draw inspiration from ImageNet2012's lead, which takes a hybrid of (i) and (ii), described here http://www.image-net.org/challenges/LSVRC/2012/#task, the tl;dr: being

For each image, algorithms will produce a list of at most 5 object categories in the descending order of confidence.

This raises two possible differences from the visual recognition space: one, visual scenes can have a lot going on, so there's no natural upper-bound on how many classes might occur; and two, there's no guarantee every object will be labeled in a scene.

Does this get us closer to a task definition? anything that would still need to be addressed? is this the task we're happy to tackle?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/omec/bedevere/issues/19#issuecomment-243963366, or mute the thread https://github.com/notifications/unsubscribe-auth/AFItP1w6HJY4AZt0lw-O9q1eACt28yCCks5qlkDFgaJpZM4JyLcH .

julian-urbano commented 8 years ago

@dmcennis we're talking about defining the MIR task we'll address. "Instrument classification" is too vague. We are not talking here about back end, database, format, syntax or anything like that. I'd like to ask you to delete your previous comments in this issue so that we keep it clean and focused. Thanks!

julian-urbano commented 8 years ago

1a converges to 1b as the time window increases in length, no? so aren't both 1a and 1b classification?

You could say that, but the tasks are fundamentally different, right? 1a could even be made in an online manner.

In any event, the use case I keep coming back to is answering the question "what instruments are active in a given sound recording?" With this in mind, I've been strongly advocating for 1a, on the order of 5-30 seconds.

So this is 1b, right? Like "there are guitar and piano somewhere in this piece". In that spirit, I'd go for your option ii (multi-class classification). In any case, this is completely compatible with 2a in terms of audio and annotations, so I think we could have both tasks. I see them as two completely different use cases with completely different objectives and performance measures.

The other options, where we would need annotations and predictions at the window level, could wait for another year, I agree. If the argument is "let's keep it simple for now and grow later", I'm 100% in. Just wondering if we could design things thinking about this coming up in the near future.

Now, another issue: could it be hard for people to distinguish all instruments? I'm thinking about regular songs where it's not evident how many are playing, and which ones. Let alone for regular folks without musical background.

markostam commented 8 years ago

@julian-urbano i'd be inclined to agree with @ejhumphrey; although all points you bring up are worthwhile for the longer term, 1a seems like a good place to focus our energy at least for the initial MVP. in the context of creating a dataset for use in instrument classification, taking small windows on a song vs. looking at an entire song has potential to benefit the dataset we eventually distribute by making it larger and more specific.

i've been doing some thinking about how an annotation would be presented to a potential user/annotator and some questions come up that might be worth addressing:

a. how is the annotator served the annotation window? is it a random sized window, in a random part, of a random song? if not, how do we prioritize each? perhaps focus on windows that have not been classified yet, or try to build consensus on parts that already do have annotations.

b. can the annotator change the size of the window they are initially served on the fly? e.g. there is a sax solo that ends halfway through their initially served window but they want to focus on given sax solo for their annotation. i would probably vote against this as it would add another technical hurdle or two but may it give us more semantically useful data in the end.

c. window size: would we benefit from using a normalized, nonoverlapping window size (e.g. 5 seconds), or is dynamically sized window ok.

ejhumphrey commented 8 years ago

Okay, it took a couple tries to get through my skull, but I think I understand the distinction you're getting at (call it an inherent DSP bias). If there's any difference, it's one of semantics:

1a: a windowed recording is a digitized signal that is finite and stationary over some observable duration. Recordings shorter than a desired observation length are treated in one of a few sub-optimal ways, e.g. zero-padding, symmetric repetition, periodic repetition, etc. No claims or assumptions are made as to what the signal represents, other than a digitization of some acoustic phenomena that occurred in the natural world. Others might refer to this as a "sliding window" view of the world.
1b: a recorded piece (or track) is a signal representing a complete, wholistic observation of a musical idea, and thus contains all information necessary to "understand" it. This is more "GTTM" in practice, and requires systems that are capable of processing content of arbitrary duration.

In this manner, I definitely mean 1a, though I still think classification applies to both of these formulations. I'm proposing that we sample one fixed length window per signal, and only collect instrument classes over those N-second observations (taking care to keep track of where these windows came from). This normalizes the time duration across all signals, which means both humans and machines are only ever responsible for what occurs in a small signal.

As to any concerns about instrument coverage in a track, the ImageNet problem formulation tries to address this (there's math, so I won't copy/paste - but do check it out if you haven't).

re @markostam's questions a. I think we'd pre-extract windows, or at least a single (start, end) time tuple for what each observation is. All windows would have the same duration, tracks less than this duration should probably be omitted. b. I agree that, no, annotators will not be able to change the size of the window in this iteration of the challenge. Like ImageNet, we can revisit this in the future. c. I'm thinking a fixed window size, somewhere on the order of 5 to 30 seconds, leaning toward the shorter side. Maybe 10. Too long and it's just going to be [guitar, voice, drums, bass].

justinsalamon commented 8 years ago

In case this helps, there are parallels to be drawn with polyphonic Sound Event Detection, where the task definition (and metrics) have been formalized quite nicely in this paper (see section about segment-based metrics).

Basically the idea is that you define a segment duration (and this duration essentially separates 1a from 1b): in each segment there are X instruments present and the algorithm returns a list of instruments Y, and by comparing the lists you can compute TP, FP and FN. Then you sum these for the entire recording to compute a global F1.

So, if the segment duration is e.g. 1s, you get task 1a. If you set the segment length to the entire recording, you get 1b (I think this is what @ejhumphrey was getting at).

By formulating it in this way, you can "consider" different use-cases, for example, if you just want to know which instruments are in a recording, you set the segment length to the recording duration. However, if you have a user who cares about what you'd probably refer to as "instrument activations", then you use a smaller segment duration. Personally I don't think a segment duration on the order of milliseconds makes much sense and would opt for something on the order of .5 or 1s, but that's another question.

Anyway, to make a long story short, my understanding is that @ejhumphrey was advocating for using segment==recording for the first iteration of this initiative, with the option of doing segment < recording in the future. I think this makes sense, but at the same time annotations on the recording level would be useful for 1b but not for 1a, which might mean wasted annotation effort? If we collected annotations that support 1a (i.e. ask people to annotate start/end times of each instrument), then these annotations would support both 1a and 1b. I also think there'd be people interested in both tasks (an algorithmic solution to 1a also solves 1b, but not vice-versa).

EDIT: annotations for 1a would also support varying segment durations, which could then be debated/experimented with post-hoc.

ejhumphrey commented 8 years ago

I'm advocating for drawing one 10 second clip from each track (preserving metadata), resulting in a collection of 200k, 10 second clips. We then annotate those clips as full observations.

justinsalamon commented 8 years ago

By full observation you mean clip-level binary (presence/absence) labels for each instrument, right? So this would be task 1b, where the clip/segment duration is 10s.

An alternative (which I'm not necessarily advocating for, but rather putting on the table) would be to collect annotations on a 1s granularity (i.e. divide the clip in to 10 1s segments and for each one have the annotator indicate the absence/presence of each instrument). This way we could run both 1a and 1b, although admittedly the annotation cost would be greater (though not necessarily x10 greater).

ejhumphrey commented 8 years ago

ha, man, reading is hard. I'm finally, finally getting it. yes @justinsalamon, I mean

task 1b, where the clip/segment duration is 10s.

I'd advocate that instrument tags are binary at the level of each annotator, rather than muddy the waters with confidence scores. However, we might / could fuse multiple annotations into a "softer" affinity vector.

Valid point re: merging the two, but let's stick to the simplest thing. Plenty of ideas for subsequent iterations.

markostam commented 8 years ago

I'm advocating for drawing one 10 second clip from each track (preserving metadata), resulting in a collection of 200k, 10 second clips. We then annotate those clips as full observations.

This implies we have access to 200k source tracks and take a random 10 sec clip from each, correct (just want to be clear)? Why not just segment each track into 10 second clips, eg 60 second track = 6 clips.

I suppose this may be a larger questions and another topic altogether: how much of each source track do we want as a clip, what time window do we want annotations at and what are the tradeoffs. @justinsalamon mentioned 10s/track clip and segment that into 1s segments. I think something along those lines seems reasonable. However, due to shifting song dynamics in different parts of a source track, maybe it would be useful to take 2-3 10s clips per track. We could then slice those clips into whatever size segments we wanted for annotator classification, maybe something between 1-5s.

Also think it's worth bringing MajorMiner to the group's attention. It's music tagging webapp that accomplishes some of what are setting out to do re: tagging music clips and providing positive reinforcement to annotators. I think its a useful reference point and informative to play around with.

I'd advocate that instrument tags are binary at the level of each annotator, rather than muddy the waters with confidence scores. However, we might / could fuse multiple annotations into a "softer" affinity vector.

+1 very much in agreement

ejhumphrey commented 8 years ago

oh wow, I didn't realize MajorMiner was still online. Definitely informative ... perhaps the major difference will be constraining the instrument taxonomy, i.e. presenting a space of options, rather than providing an open-ended text field. It also shows that visualization, while nice, is by no means necessary.

I've been using the "200k" figure as a ballpark from conversations at ISMIR. Not sure if that's what it'll end up being, waiting to hear what comes of conversations with Jamendo crew.

I see the merit to a sliding window within a clip, but I'd really like to avoid it as a simplification for this first iteration. IIRC, ImageNet didn't worry about localization at first, and I think it'd be prudent if we didn't either. For what it's worth, ImageNet consists of images on the average of 400 × 350 pixels, which are reasonably large when compared to MNIST (28x28) and CIFAR (32x32). Image dimensions are loosely similar to signal duration in audio --it's not really, but practically speaking, at least-- so a medium sized image is kind of like having a 30 second audio clip.

My internet digging into ImageNet has caused some other, slightly more tangential thoughts that I'll add to #16, but wanted to link to this discussion here.

markostam commented 8 years ago

perhaps the major difference will be constraining the instrument taxonomy, i.e. presenting a space of options, rather than providing an open-ended text field. It also shows that visualization, while nice, is by no means necessary.

yup, exactly, on both points.

I see the merit to a sliding window within a clip, but I'd really like to avoid it as a simplification for this first iteration.

fair enough, i'm sure there will be plenty of technical hurdles to clear. just thought it worth some discussion.

ejhumphrey commented 8 years ago

I think that's definitely on the long term roadmap. Hopefully the ISMIR community will really rock out the first year and we'll have to make it much harder for the next iteration. :o)

Regardless of when that happens though, it's great to have all of this on public record so we can jump right in. Keep it coming!

julian-urbano commented 8 years ago

Alright, we are witnessing yet another case of the different vocabulary used by DSP and IR people, plus my evident inability to communicate :weary: I'll try to make myself clear with some notation and examples.

1) The system is given as input a music audio file X. It has to detect the instruments that are played.

1a) The output is a list of the instruments being played every L seconds (say 1s), which would look like a time series:

  0s-1s: guitar
  1s-2s: guitar voice
...
15s-16s: guitar       piano

1b) The output is just a list of the instruments played anywhere in the input X.

guitar, voice, piano

2) The system is given an instrument I and it has to return a list of songs where I is played.

2a) Just return the list:

X_1, X_10, X_12, ...

2b) It returns a list of songs and, for each of them, a window where I is played:

X_1 (12s-32s), X_10 (23s-48s), X_12 (18s-183s)

Obviously, 1a and 2b go hand by hand, as 1b and 2a do (bad naming, I know). As @justinsalamon pointed out, mastering 1b would almost solve 2a, and having 1a would rule them all. The differences are basically in how complex the task is and how much information systems have to return.

Now, a whole different matter is what the Xs are: full tracks vs. clips. I see this basically as a matter of practicallity and assumpsions. If systems are fed full tracks, then they can use some heuristics in 1a and 2b. If we feed them 10s audio clips, the story changes. For now I'm ok with keeping it short and making inputs just 10s long. Which 10s? Randomly seems appropriate, though I'd round to the nearest second just for practical purposes. Also, I'm against sampling several clips from the same full track, as variability across tracks is much larger than within tracks, and that's what we should seek.

So, summing up, I think we want to run 1b using as X clips of 10s each, randomly chosen from the set of full tracks. In that case, we can also run 2a because we'll need basically the same annotations and data, but the task is completely different. I don't know if people would be interested; I definitely am.

As for the points raised by @markostam, I pretty much agree with @ejhumphrey. Also, keep in mind that the dataset used by participants will contain only the inputs X, not the full songs. These are the exact same pieces that annotators will receive. Finally, most of the annotations will be gathered post hoc, after systems submit their predictions.

Sounds better?

ejhumphrey commented 8 years ago

haha communication is a two-way street.

but yes, much better! I agree with it all, though now I have questions. I have an intuition how this all works for 1b, but what does this look like for 2a, and incremental eval and the like? and how much simpler is it to focus on only one of these?

julian-urbano commented 8 years ago

I have the feeling that the two tasks are so different that this will be something fun to look at in terms of evaluation. In principle, there'll be a criterion to select examples for annotation in 1b and another one in 2a, and (at least for me) it's very interesting to see how much they agree with each other and how they could be combined. Research problem here.

Focus on just one? In terms of users, doing 1b leads to immediate implementation of 2a. If they don't wanna work much on 2a, that's fine, but submitting to it is effortless. On the dev side, it only requires a parallel instance of the submission system (one for 1b and one for 2a), and the bit of eval code to compute numbers. I don't think there's anything else, right? In terms of annotation effort, we could still focus on 1b and compute estimates for 2a. Worst case scenario is that variance will be larger than could be if focusing a bit of annotation effort on 2a as well.

In general, I think it's something we can manage. If at some point we have to favor one over the other, no problem.

ejhumphrey commented 8 years ago

this all sounds great, I think we've made some good progress here.

@julian-urbano you wanna take a first crack at consolidating this into a coherent, readable Google Doc (to maybe 60%)? Then circulate among the team for feedback, more input, and all the rest? We'll eventually need a page like this one describing the challenge in sufficient detail, and now's as good a time to start as any.

julian-urbano commented 8 years ago

will do (gdoc sent to the open-mic-dev list, right?). Yep, the idea is, at some point, create the github pages site for OpenMIC with all the info participants would need, much like what you linked. Once the task definition is fixed (hopefully by next week), we can start talking about how annotations will actually be, along with metric definitions.

ejhumphrey commented 8 years ago

it's happening

julian-urbano commented 8 years ago

I started today a more formal document about all this, and got thinking about annotations and what users would really want from instrument detection systems, particularly in the retrieval task (2a). I keep picturing an amateur guitar (or any other instrument) player who wants tracks to practice. In that case, a user would want tracks where only guitar is present, and present throughout the whole piece. Alternatively, I could be looking for accompaniment tracks, where guitar is actually not present, or not much. Would it make sense to look for these things? What about voice?

In terms of annotations:

whether the instrument is present throughout the whole thing or not is something that, regardless of this particular topic, would be very interesting to collect, thinking about future extensions of the task to 1a and 2b. It could even be a binary thing: present all the time or not.
whether the instrument is the only one playing or not can be readily derived from the annotations.

What do you guys think?

julian-urbano commented 8 years ago

Maybe even a scale like this:

3 points: the only instrument present, and through the whole thing.
2 points: not the only one present, but through the whole piece, or the only one but not through the whole piece.
1 points: not the only one present and not through the whole piece.
0 points: not present.

justinsalamon commented 8 years ago

I can see why a user (e.g. guitar player) would want tracks where the instrument is present most of the piece (the whole piece would never really happen in practice, but one could define a minimum percentage of presence).

As for it being the only instrument, realistically that would only happen in solo pieces or isolated tracks taken from a multitrack recording (not part of our intended dataset). While I can see how this would be a popular use case, I don't think it would be more popular than having guitar + other instruments (as a student when I transcribed jazz guitar solos I did it from the complete polyphonic recordings).

Regarding annotations, the only way to support this scenario is to have annotations with start/end times (rather than presence/absence). I think everyone agrees these annotations would be more useful in the long run, but would also require a greater annotation effort.

I know @ejhumphrey (and others!) advocated for keeping things simple for the first iteration, with which I agree. However, I think it's worth consideration - there's a lot of dev-effort going into this initiative right now, and it might be worth taking advantage of this effort to build out the system to support the annotations we hope to have in the long run, not shot term. As we all know, unfortunately once you create a dataset everyone will use it to death, and it might be worth starting off with a dataset that addresses the "real" problem one would like to solve, and not a simplified version. That said, track-level tags can be useful for a different use-case, which is recommender systems.

Finally, I think it's also worth noting that the annotations will define the type of machinery people will have to build - in particular, assuming many submissions will be of the deep-net variety, track-level annotations = weak labels, which is an additional hurdle submissions would have do address, compared to labels with start/end times which would be trivial to work with.

julian-urbano commented 8 years ago

As for it being the only instrument, realistically that would only happen in solo pieces or isolated tracks taken from a multitrack recording (not part of our intended dataset).

Could be. What about selecting one as the main instrument? That could be problematic too, I guess.

Regarding annotations, the only way to support this scenario is to have annotations with start/end times (rather than presence/absence). I think everyone agrees these annotations would be more useful in the long run, but would also require a greater annotation effort.

For the long run of the task, yes, but for now I was just thinking about something in the lines of two options: i) present the whole time, ii) not. Maybe even 4 options: i) whole time, ii) only beginning, iii) only ending, iv) intermittently. That simple label would allow us to refine much more the retrieval task, and would help a lot in the future if we plan on getting (start,end) annotations, as it would allow us to pre-filter examples.

All this will affect #8 and #20.

dmcennis commented 8 years ago

May I suggest 2 strains on this:

What will the task(s) be this year?
How can the design be constructed (top to bottom) to accommodate ALL future versions of the task? Many of these require augmenting back ends and annotation tools now or in the future.

Daniel.

On Wed, Aug 31, 2016 at 7:47 PM Julián Urbano notifications@github.com wrote:

Let's discuss here a definition of the task and use case as precise as possible. The idea is simple, but it can get complicated once we get into it. As I see it, much of the details we'll have regarding data, annotations, metrics, taxonomy and so on, will be restricted by this.

I see two main use cases, and another two within each:

1) Given a piece of music, identify the instruments that are played in it: 1a) For each window of X milliseconds, return a list of instruments being played (extraction). 1b) Simply return the instruments being played anywhere in the music piece (classification). 2) Retrieve a list of music pieces in which a given instrument is played. 2a) Just return the list (retrieval). 2b) Return the list, but for each music piece also provide a clip (start-end) where the instrument is played (question answering-like).

We should agree on what task we are talking about here. My impression is that 1a and 2b make the most sense, but let's discuss.

For cases 1) it seems alright to feed systems with audio clips, but for cases 2) I think full tracks are more appropriate. Also, for 2) I think the user might be someone who is learning to play an instrument and wants samples to practice. Just an idea.

In terms of performance measures and annotations, I think we can for now go with the typical stuff used in extraction, classification, retrieval and question answering, as I indicated. I'd like to keep this thread just about the task definition and use cases, and later one we'll discuss about metrics, audio, etc.

Comments?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/omec/bedevere/issues/19, or mute the thread https://github.com/notifications/unsubscribe-auth/AFItP6D69N2NWLkdjIi1UcVIRLqgDGvJks5qlhKGgaJpZM4JyLcH .

franrodalg commented 8 years ago

TL;DR: Adopting an incremental annotation strategy on potentially biased data (as a collection obtained from Jamendo might end being) might have unexpected consequences on the task we pretend to evaluate systems on. Comments/criticisms are very welcome!

The unambiguous definition of the MIR task should be, in my opinion, the first and most important step when designing any system evaluation. I feel, however, that many people around here consider the task to be merely "what we ask humans to annotate in the collection", which I strongly disagree. The task is what we ask the systems to perform, and what we will base our judgements on. It is essential to make sure the task definition is as accurate as possible so that we can extract valid conclusions from the results we obtain. In this sense, here I include some reflections about issues to consider if we want to reach an accurate definition (more 2 bucks than 2 cents, probably... sorry!).

I hope we all agree that we cannot feasibly evaluate a system for its ability to identify any possible instrument in a recording, specially if we deal with multiple instruments playing simultaneously. The list of possible classes and combinations would be too large to manage in a realistic scenario, both from the point of view of the system developer and the test collection curator (if any). This seems to require a taxonomy (as it is currently being discussed in #2), or another explicit delimitation of the classes, and combinations of classes, under which the systems will be evaluated. I think, then, that the task definition must unambiguously refer to the actual taxonomy/class list considered. In other words, it is different to evaluate a system with respect to its ability to:

Given a piece of music, identify the instruments that are played in

(option 1 of the original post by @julian-urbano) or to

Given a music audio recording, identify which instruments from a particular taxonomy/list appear.

Admittedly, the distinction is quite subtle and may not have strong implications in the practical issues most people around here seem to be interested in, but I think it is essential to avoid extracting too bold conclusions from the experimental results.

Even if we ignored the distinction in the task definition, even if we assumed that the results obtained by a system in the implicitly limited task would easily generalise to any other non-considered case, the taxonomy/list would still determine how do we estimate success. In a normal case, we would then build a collection adapted to such taxonomy/list. Alternatively, we could first gather the collection and then fix the set of possible classes to the contents of the collection. This is basically the usual procedure in our field, even though the generalisability of the conclusions we might extract in this manner seems even more debatable. In OpenMIC, however, we find ourselves in a third scenario: the collection and the classes are built independently of each other. While this approach might have a certain appeal with respect to the generalisability of the results, I think raises some questions that deserve proper discussion. In particular, this comment by @ejhumphrey in this same issue:

I'm thinking a fixed window size, somewhere on the order of 5 to 30 seconds, leaning toward the shorter side. Maybe 10. Too long and it's just going to be [guitar, voice, drums, bass]

made me realise that:

the Jamendo collection will (probably) be biased towards certain styles/genres; and
adopting an incremental annotation approach implies we will not know the contents of the collection until after we employ its recordings.

Does this even matter for the task definition? In my opinion, yes, it does. I think we cannot claim a system is identifying instruments, even if we restrict them to a subset of all possible classes, if we ignore the cases we can actually test. If, as @ejhumphrey suggested, most recordings are [guitar, voice, drums, bass], or a subset of them in certain time frames, can we conclude anything about the ability of the systems to recognise violin or french horn, even supposing these instruments appear on the taxonomy?

Thinking about an example outside our field might help understand better what I mean. If we asked a system to diagnose diseases in patients, and the set of possible classes was [HIV, Cholera, Flu], which conclusions could we extract if no patients we happen to have available suffered from HIV or Cholera? Would we say that the systems are diagnosing diseases (in general) or even that they are diagnosing [HIV, Cholera, Flu]? Or, in our case, is predicting [guitar, voice, drums, bass] enough proof instrument identification? (Notice I am not even talking about horses, here!)

This might look like simple overthinking for most people around here, but I really believe we should reflect carefully about it. From a more practical point of view, I think the points mentioned before have a very important implication. In the paper it is mentioned that the plan involves annotating and releasing a development set, so I assume we expect the submitted systems to employ a usual supervised learning/train-test approach. But, how do we know that the development data contains all possible classes that might appear during testing? In other words, with the proposed strategy it seems likely that the training data will only/mostly contain [guitar, voice, drums, bass], but we will be completely blind to the instruments that might be included in the testing data. What is the task, then? It seems to me that the task is limited to something like:

Given a music audio recording, identify which instruments appear among those included in the development set

which I admit is a quite ugly definition.

Unfortunately, it can become even uglier. In theory, anything that we keep fixed in the "experiments" should also be considered part of the task definition. I am thinking about the clip length, as mentioned above. But also in the file format, as discussed in #17. Why? Think about the disease diagnostic example I introduced before. If all patients are males, for instance, we cannot extrapolate the results we obtain to females unless we prove there is no dependency between the method and the gender. Or, in our case, dependency between method and characteristics of the actual recordings employed. In the general case, I suspect this proof is beyond our capabilities. So the task definition would look like something similar to:

Given a N seconds-long clip from a music audio recording in X file format (at sampling frequency F), identify which instruments appear among those included in the development set.

(we could alternatively generate a variety of file format versions for each recording and select at random the one to employ if we want to attempt to obtain a more generalisable result).

Finally, I agree with @julian-urbano that, even though instrument identification is one of the few tasks in our field that might have an objective ground truth (an instrument is played or is it not), annotators might be far from infallible. Does the potential ambiguity of the annotations affect the task? Are we doomed to an horrific task definition?

Given a N seconds-long clip from a music audio recording in X file format (at sampling frequency F), predict which instruments annotators identify among those included in the development set.

Sorry for the annoyingly long comment! What do you guys think? Any ideas are highly welcome!

dmcennis commented 8 years ago

I've underlined a couple sentences - the same was said of polyphonic transcription in my youth in the 90's, but now is a MIREX track. Flexibility in the road-map to morph as capabilities improve is critical.

Please refer to Kris West's excellent doctoral thesis on robustness of audio features to noise for an overview of file format impacts. It only matters for some features. Should file formats be randomized to maximize human like results? Should the system reward algorithms that exploit non-human features with superior 'performance'?

These questions are the difference between an annotation agnostic framework and a maintenance of annotation formats in the open-mic system to accommodate these future use cases.

Daniel McEnnis.

On Mon, Sep 12, 2016 at 11:23 AM, Francisco Rodríguez-Algarra < notifications@github.com> wrote:

TL;DR: Adopting an incremental annotation strategy on potentially biased data (as a collection obtained from Jamendo might end being) might have unexpected consequences on the task we pretend to evaluate systems on. Comments/criticisms are very welcome!

The unambiguous definition of the MIR task should be, in my opinion, the first and most important step when designing any system evaluation. I feel, however, that many people around here consider the task to be merely "what we ask humans to annotate in the collection", which I strongly disagree. The task is what we ask the systems to perform, and what we will base our judgements on. It is essential to make sure the task definition is as accurate as possible so that we can extract valid conclusions from the results we obtain. In this sense, here I include some reflections about issues to consider if we want to reach an accurate definition (more 2 bucks than 2 cents, probably... sorry!).

I hope we all agree that we cannot feasibly evaluate a system for its ability to identify any possible instrument in a recording, specially if we deal with multiple instruments playing simultaneously. The list of possible classes and combinations would be too large to manage in a realistic scenario, both from the point of view of the system developer and the test collection curator (if any). This seems to require a taxonomy (as it is currently being discussed in #2 https://github.com/cosmir/open-mic/issues/2), or another explicit delimitation of the classes, and combinations of classes, under which the systems will be evaluated. I think, then, that the task definition must unambiguously refer to the actual taxonomy/class list considered. In other words, it is different to evaluate a system with respect to its ability to:

Given a piece of music, identify the instruments that are played in

(option 1 of the original post by @julian-urbano https://github.com/julian-urbano) or to

Given a music audio recording, identify which instruments from a particular taxonomy/list appear.

Admittedly, the distinction is quite subtle and may not have strong implications in the practical issues most people around here seem to be interested in, but I think it is essential avoid extracting too bold conclusions from the experimental results.

Even if we ignored the distinction in the task definition, even if we assumed that the results obtained by a system in the implicitly limited task would easily generalise to any other non-considered case, the taxonomy/list would still determine how do we estimate success. In a normal case, we would then build a collection adapted to such taxonomy/list. Alternatively, we could first gather the collection and then fix the set of possible classes to the contents of the collection. This is basically the usual procedure in our field, even though the generalisability of the conclusions we might extract in this manner seems even more debatable. In OpenMIC, however, we find ourselves in a third scenario: the collection and the classes are built independently of each other. While this approach might have a certain appeal with respect to the generalisability of the results, I think raises some questions that deserve proper discussion. In particular, this comment by @ejhumphrey https://github.com/ejhumphrey in this same issue:

I'm thinking a fixed window size, somewhere on the order of 5 to 30 seconds, leaning toward the shorter side. Maybe 10. Too long and it's just going to be [guitar, voice, drums, bass]

made me realise that:

the Jamendo collection will (probably) be biased towards certain styles/genres; and

adopting an incremental annotation approach implies we will not know the contents of the collection until after we employ its recordings.

Does this even matter for the task definition? In my opinion, yes, it does. I think we cannot claim a system is identifying instruments, even if we restrict them to a subset of all possible classes, if we ignore the cases we can actually test. If, as @ejhumphrey https://github.com/ejhumphrey suggested, most recordings are [guitar, voice, drums, bass], or a subset of them in certain time frames, can we conclude anything about the ability of the systems to recognise violin or french horn, even supposing these instruments appear on the taxonomy?

Thinking about an example outside our field might help understand better what I mean. If we asked a system to diagnose diseases in patients, and the set of possible classes was [HIV, Cholera, Flu], which conclusions could we extract if no patients we happen to have available suffered from HIV or Cholera? Would we say that the systems are diagnosing diseases (in general) or even that they are diagnosing [HIV, Cholera, Flu]? Or, in our case, is predicting [guitar, voice, drums, bass] enough proof instrument identification? (Notice I am not even talking about horses, here!)

This might look like simple overthinking for most people around here, but I really believe we should reflect carefully about it. From a more practical point of view, I think the points mentioned before have a very important implication. In the paper it is mentioned that the plan involves annotating and releasing a development set, so I assume we expect the submitted systems to employ a usual supervised learning/train-test approach. But, how do we know that the development data contains all possible classes that might appear during testing? In other words, with the proposed strategy it seems likely that the training data will only/mostly contain [guitar, voice, drums, bass], but we will be completely blind to the instruments that might be included in the testing data. What is the task, then? It seems to me that the task is limited to something like:

Given a music audio recording, identify which instruments appear among those included in the development data

which I admit is a quite ugly definition.

Unfortunately, it can become even uglier. In theory, anything that we keep fixed in the "experiments" should also be considered part of the task definition. I am thinking about the clip length, as mentioned above. But also in the file format, as discussed in #17 https://github.com/cosmir/open-mic/issues/17. Why? Think about the disease diagnostic example I introduced before. If all patients are males, for instance, we cannot extrapolate the results we obtain to females unless we prove there is no dependency between the method and the gender. Or, in our case, dependency between method and characteristics of the actual recordings employed. In the general case, I suspect this proof is beyond our capabilities. So the task definition would look like something similar to:

Given a N seconds-long clip from a music audio recording in X file format (at sampling frequency F), identify which instruments appear among those included in the development set.

(we could alternatively generate a variety of file format versions for each recording and select at random the one to employ if we want to attempt to obtain a more generalisable result).

Finally, I agree with @julian-urbano https://github.com/julian-urbano that, even though instrument identification is one of the few tasks in our field that might have an objective ground truth (an instrument is played or is it not), annotators might be far from infallible. Does the potential ambiguity of the annotations affect the task? Are we doomed to an horrific task definition?

Given a N seconds-long clip from a music audio recording in X file format (at sampling frequency F), predict which instruments annotators identify among those included in the development set.

Sorry for the annoyingly long comment! What do you guys think? Any ideas are highly welcome!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cosmir/open-mic/issues/19#issuecomment-246383917, or mute the thread https://github.com/notifications/unsubscribe-auth/AFItP8TQR6pn-2bppWbIRYsju9b0BL12ks5qpW5xgaJpZM4JyLcH .

franrodalg commented 8 years ago

Thanks for your comments @dmcennis. I am afraid I will need further clarifications in order to make sure I understand your points, though.

I've underlined a couple sentences - the same was said of polyphonic transcription in my youth in the 90's, but now is a MIREX track.

If you are referring to particular sentences in my previous comment, I am not able to know which ones unless you quote them explicitly in your own comment. In any case, I fail to see why polyphonic transcription becoming a MIREX track should be relevant for our discussion here.

Flexibility in the road-map to morph as capabilities improve is critical.

Which capabilities are you referring to? Of the systems? Of the evaluation? (What I call system is, I think, what you call "algorithm". I tend to avoid employing the term "algorithm" in this context as we rarely know which are the "instructions" that machine learning models "execute" to predict labels.)

Please refer to Kris West's excellent doctoral thesis on robustness of audio features to noise for an overview of file format impacts. It only matters for some features.

I am aware that there have been studies at this respect. @julian-urbano, for instance, published a paper on the topic in ISMIR a couple of years ago. In any case, we should keep in mind that we intend to evaluate systems on their ability to address the task we pose regardless of the method they employ. What I mean is that there is no restriction on which audio features those systems might extract. They could take advantage of those that we have previously tested for robustness, or they might propose new untested features. And one might be surprised about what systems employing low-level signal processing techniques plus machine learning algorithms might be able to exploit from the audio recordings. For a recent example, see here. [I might actually check what happens with these features when the audio is converted to mp3. I suspect we would see substantially large changes in performance.] And we should not forget about feature learning approaches with deep networks that are becoming so popular in recent years. What might these features capture from the audio? Is that independent of the codification? Who knows.

Anyway, control for the file format is far from the most important of the concerns I expressed in my previous comments, in my opinion.

Should file formats be randomized to maximize human like results? Should the system reward algorithms that exploit non-human features with superior 'performance'?

I personally think that instrument identification is one of the very few music classification "tasks" that does not require "human-like" results. An instrument is played or not, regardless of human opinion. In this sense, submitted systems should not be limited to "human-like features". Randomisation here is not meant to control for this kind of situations. As I said, I am not considering horses (yet). I proposed randomisation to increase the external validity of the results, their generalisability, regardless of how are they obtained. In other words, to attempt to reduce the dependency of the conclusions to the particular implementation of the collection.

These questions are the difference between an annotation agnostic framework and a maintenance of annotation formats in the open-mic system to accommodate these future use cases.

I think we do not understand the concept of use cases in the same way. For me a use case is the description of the information needs of a particular user of the system to be evaluated. For example, the guitar player searching for tracks to practice that was mentioned in previous comments. What do you mean by use case here?

dmcennis commented 8 years ago

I dislike the words 'can not' and 'will not' in a design document. They are usually placeholders for places where the design will fail and collapse if the assumption is violated. Your confusion about use cases are perplexing - we have 4 use cases in the problem description, whether we future proof designs to accommodate all 4 or not.

ejhumphrey commented 8 years ago

@franrodalg Definitely lots to chew on here. Luckily I've been traveling and have had a bunch of time to mull this over. I think perhaps the crux of your comment / concern is found here:

The unambiguous definition of the MIR task should be, in my opinion, the first and most important step when designing any system evaluation. I feel, however, that many people around here consider the task to be merely "what we ask humans to annotate in the collection", which I strongly disagree.

I'd contend that the very topic of machine perception (audition, vision) is --for all / any of its shortcomings-- based on the premise of behavioral intelligence: if it sees like a human and hears like a human, then sure, it's human-level perception [alternatively, the unfortunately named thought experiment from Searle in the 1980s]. The question is, and has always been, "does a proposed system model the behavior of our target system (expert)?" This framework proceeds by collecting past behaviors of the target, conditioned on some input stimuli, and measures how well this behavior is reproduced. To be clear, the majority of tasks we've ever actually pursued are those of the kind that model annotators because it's more tractable than trying to define the concepts directly (genre, similarity, emotion, etc).

<sidenote> In situations where the relationship between stimuli and behavior is 1:1 (a function) and stateless (time-invariant), this formulation works out pretty well (see handwritten digit recognition, image labeling, speech recognition, etc) and less so when it doesn't, because it's difficult (impossible?) to capture the latent state of an annotator corresponding to an observed behavior. </sidenote>

Importantly, in this behavioral approach, the primary concern is that a system behaves a certain way, not that it understands why it does. While an interesting question (the most interesting?), it's a higher level abstraction and one that we don't need to tackle yet.

Thus, the primary (only?) concern with building "dumb" machines that lack a higher level of understanding is that they tend to extrapolate (generalize) quite poorly because they know not why they do what they do. This should not be interpreted as a limitation, but rather as a greater burden on the evaluation methodology to make sure that the model isn't especially fragile (sensitive to trivial amounts of noise, perceptual codecs, changes in volume, etc). Said differently, the problem isn't necessarily problem definition but test design (like unittests, interviews, and so forth), and making sure that the gamut explored in an exam is fair given the problem definition, i.e. no calculus on a history test. For annotators, this is implicit; for computational models, however, it must be made more explicit.

That said, two other things:

But, how do we know that the development data contains all possible classes that might appear during testing?

Because we will build it to be so.

Are we doomed to an [sic] horrific task definition?

No, I actually think your "horrible" definition got quite close, with minor edits :o)

Given an N seconds-long music audio signal, predict which instruments annotators identify among those defined by the global taxonomy.

Perhaps the piece that's missing then is, given this definition, what is the fair distribution of signal characteristics that should appear in the development / test sets, and how should such a test be designed to evaluate it?

julian-urbano commented 8 years ago

I hope I'm not too late for the party. Thank you all for the comments!

Let me begin with a few premises that might not be clear yet (notice the "at least for now"):

Our main goal, at least for now, is not to evaluate systems for some task, nor is it to come up with an infalible methodology to evaluate systems so that they don't fool us. The main goal is to generate useful data.
This means that, at least for now, nobody should be considered the winner, nor should any system be considered to have solved the task.
We also want to be as open as it gets, especially with publicly accessible audio data.
As @ejhumphrey said, one thing is what your task is, and another one is how you design the experiment to evaluate systems for that task. This involves a lot of assumptions.
One problem at a time. This thing will evolve.

(feel free to disagree with the above)

Can someone claim that their system learned to identify instruments? No. Will we say it? No. Can we prevent them from saying so? No. Will we (COSMIR) have failed then? No. Do we even care? Not now.

@franrodalg's concern about bias is well justified. I'm concerned about it too. However, I think it's something that we should be able to avoid to a large degree:

Train/test: if it turns out that some instrument from the taxonomy does not appear in train, we simply remove it. If in test we find an instrument we didn't have in the taxonomy, we use an other instrument class and we'll take care of it later. Remember that the goal is not evaluation at this point, but data generation. Also, keep in mind, that this thing will evolve.
I don't really think Jamendo is that biased, but in any case, remember that we will not evaluate on the whole Jamendo corpus. The test set will not be defined upfront. Rather, we'll use our models from incremental evaluation to decide what to annotate and what should be part of the set. This means we can aim for things like uniform distribution of genres or, what the hell, even instruments. Same goes for the train set, except that our estimates will be much worse at the beginning. Remember that this thing will grow (and improve). So will it all be voice, guitar and drums? Not if we don't want to. Also remember that we want to use publicly accessible audio, which highly restricts us.

As for the file format issue, I think it's irrelevant at this point. Why? Because we don't want to evaluate the ability of systems to identify instruments. For now we want to generate data of the form song-instruments. The dataset will refer to the song object, not to the audio file object, so to speak. Once we confirm the sustainability of the COSMIR approach, we'll take care of evaluation intricacies (yes, I said it! :open_mouth:)

Finally, a recurrent issue is how the task definition incorporates the details of the evaluation. The task is:

Identify the instruments that are played in this music piece.

The existence of previous knowledge about what instruments we should identify is evident, as would be with humans. Does the taxonomy depend on the data? Of course, but we are limited to certain corpora because we want to be able to distribute it freely. Our assumption is that it will be representative of all music. Is it? I don't know. We will use something like 10s clips from each Jamendo track, because of practical reasons. Does performance on the clip reflect performance on the track? We assume it does. Could systems use additional features if fed with full tracks? Probably.

Where I'm trying to get here is that we'll have to compromise in many points, all of which affect the evaluation results and the validity of our conclusions. But I'll not be the one claiming that some system learned to identify instruments based on what we will do this first year. Nobody should. Maybe in a few years time we come up with the definitive evaluation methodology to be able to claim that. I hope the data we generate in OpenMIC will help us get there. That's (one of) my goal(s).

franrodalg commented 8 years ago

Hi guys and thank you very much for your comments. I really appreciate them. Sorry about the delay in replying, but this last week has been quite crazy. Hope I can still contribute to the discussion.

I am not really sure if I understand @ejhumphrey's points, though. In my previous comments I tried to make clear that (here) I will not fight for an evaluation that attempts to determine why a system performs in a particular way. In other words, I was not trying to defend "human-like" approaches as the target of this challenge. In fact, as I said before, I honestly believe that instrument identification is one of the few tasks in our field in which machines should be allowed to exploit (almost) any possible cue, even those we cannot perceive. Unlike genre, or emotion, which are inherently human constructs (so I find difficult to accept "non-human" approaches in those problems), instrument presence is essentially objective: an instrument is present or not in a particular instant of a particular recording. So I do not know exactly why @ejhumphrey felt that the core of my argument involved tackling these kinds of issues.

What I tried to express is that task definition and test design are instrinsically bound. What we can, and cannot, test will affect what we can ask the machines to do. @julian-urbano claimed that:

I'll not be the one claiming that some system learned to identify instruments based on what we will do this first year. Nobody should.

And I couldn't agree more. But I am pretty sure that some (many) will do. And, unlike @julian-urbano, I think we should care. Even if it is at the level of @ejhumphrey's edited task definition, I strongly believe that we should make sure everyone is aware of what the systems are really tested in. I know "identifying instruments" sounds more appealing than my "horrible" definitions. But until we have the chance of ensuring some external validity in the evaluation methodology, I still think we should be cautious about stating potentially misleading goals, and make sure the assumptions are explicit.

I will address/expand some other points tomorrow, but I think these two comments indicate that we should reflect carefully about how the incremental annotation process will work and whether/how it will affect the task we intend to design an evaluation for:

But, how do we know that the development data contains all possible classes that might appear during testing? Because we will build it to be so.

Train/test: if it turns out that some instrument from the taxonomy does not appear in train, we simply remove it. If in test we find an instrument we didn't have in the taxonomy, we use an other instrument class and we'll take care of it later.

(I am still overthinking, right?)

franrodalg commented 8 years ago

Sorry guys, closed by accident :$ Reopening!

boblsturm commented 8 years ago

Dear all. I am coming out of lurk mode now. This is fascinating reading, and I applaud the effort here! I would like to contribute three comments.

"Instrument recognition" makes less sense as the time window decreases. So, I think specifying a "music recording unverse" of 10 second music excerpts is just fine. (A user who desires precision at, say, half a second, might be naive of how some musical instruments work. However, such sub-second "success criteria" can be reasonable if the "music universe" involves idiophones. Such a system would also be useful for transcription.)
I don't see any task here that motivates unsupervised approaches -- a hot topic today because so much has yet to be done. An example is this recent challenge: http://www.lscp.net/persons/dupoux/bootphon/zerospeech2014/website/ Such tasks could be counting the number of instruments heard in a given excerpt, clustering excerpts that feature the same instruments, finding representative clips of instruments, etc.
"drums". Is it one instrument or many? Snare is different from cymbal. Open hi-hat is different from closed hi-hat. Chimes and a lion roar are in magical worlds of their own. That brings up problems with "guitar". Acoustic, electric, with effects... what about feedback? Should feedback be labeled "guitar"? And then there's the problem with synths, those troublesome music imposters.

Thanks!

ejhumphrey commented 8 years ago

re: @boblsturm's comments

Fully agree.
For our purposes of managing / overseeing this campaign, I don't think we need to concern ourselves with supervised / unsupervised anything.
Definitely an important conversation, which I'd redirect to #2.

ejhumphrey commented 8 years ago

@julian-urbano has started putting together task definitions for classification and retrieval

There are two ways I could see collaboration happening at this point:

migrating both to google docs for fast collaboration / iteration, and migrate changes back
pull requests / comments directly on the files that checked into the repository (linked above).

thoughts?

julian-urbano commented 8 years ago

Everybody hold on a second!

I actually had a gdoc for all this, but stopped for several reasons regarding taxonomy. I just put the website online to have something to show this week.

The idea is to have the gdoc around and collaborate there. Better if I keep changing the website myself for the time being.

Just give me a couple days to finish the document and I'll ping you all.

ejhumphrey commented 8 years ago

😞 okay I will wait .... godspeed!