Notes on sampling audioset for instrument model building

bmcfee commented 6 years ago

As of now, the plan is to develop a multi-label instrument classifier (k=23 plus observations with no instruments) on audioset, and deploy it over FMA. Since we don't really have a great handle on the class distributions in either set, we should think a bit about sampling strategies for training and evaluation.

Issue 1: label noise in audioset

As we've noted elsewhere, there's some non-trivial label noise in audio set. This comes in two basic flavors:

Examples where the label covers only a small fraction of the clip (eg beginning or end)
Examples where the label is wrong / missing

For the first case, we might want to think about heuristics to eliminate videos where this seems likely.

There's not a whole lot we can do right now for the second case.

Issue 2: class imbalance

The data is necessarily imbalanced, so we should think about sampling strategies to combat this.

One idea is to use pescador to make a streamer for each class, with the understanding that some examples will be generated by multiple streamers. We can then shuffle the class-conditional streamers according to their empirical frequency.

Alternately, we can shuffle uniformly (balanced sampling) but use importance weights to simulate over-/under-sampling in expectation. Keras's fit_generator supports this through the sample_weights (third tuple entry) field.

Or we can just stream random examples without stratified sampling. This might have a lot of problems for rare classes though.

Issue 3: negative examples

Because FMA has a non-trivial amount of non-musical content, we should include examples with no instruments as negative examples. Right now the proposal is to sample these randomly from the negative examples in audioset, and draw a set of size approximating the average observation count across all instrument categories.

ejhumphrey commented 6 years ago

possible correction, "the plan is to develop a multi-instrument regressor (k=23) on audioset", right?

Regarding partial label coverage ... I was thinking about applying windowing (fade in / out) to clips when validating with humans, perhaps we can omit the first and last second (i.e. frame) from audioset to the same effect?

bmcfee commented 6 years ago

+1 on fading / center-cropping clips for humans.

I think we also might want to look at detecting clips where there's large divergence at the boundary frames. Something like the following for X.shape = (10, 128):

m_mid = np.mean(X[1:-1], axis=0)
v_mid = np.sum(np.var(X[1:-1], axis=0))
v_edge = np.max( np.sum((X[0] - m_mid)**2), np.sum((X[-1] - m_mid)**2))

edginess = v_edge / v_mid

The edginess score should spike if we have larger than expected deviation from the mean at one of the boundaries. If we drop the top percentile of tracks according to this score, I think we'd lose a lot of the problem cases. I'll investigate this once we have the data processed into a python-friendly format.

bmcfee commented 6 years ago

Back to point 2: we could also use balanced sampling and just set the class_weight field in keras.fit_generator according to their empirical frequencies. Then we don't have to deal with threading importance weights through the streamers.

bmcfee commented 6 years ago

Also discussed making the weighting a hyper-parameter: w[i] = exp(alpha * p[i]) for alpha in [0, 1] and class frequencies p[i]. This would let us smoothly interpolate between different sampling strategies.

ejhumphrey commented 6 years ago

update on (3): so I thought I finished this, but a qualitative sniff test revealed my faulty logic ... of course it wouldn't be that easy.

First, hopefully less debatable: how many negative examples to draw. I've concluded that, to allow for multi-frame classifiers, we need to sample full excerpts as negative examples, e.g. draw a non-instrument subset at the video ID level and take all time-varying features. The average video count per class is ≈8600 ... but if you consider the distribution here, "none" would be the fifth most represented class. For reference, the median of the class counts is ≈4600. Somewhere between these two is probably fine.

Second, and more troublesome: I randomly sampled 10 videos with classes not in the OpenMIC-23 and manually inspected them as a sanity check ... only to find that 6/10 had the general "music" tag, some of which do have OpenMIC instruments in them. If we treat AudioSet as a kind of weakly labeled collection, we might need to discard data that might contain our instrument classes.

I've given AudioSet taxonomy a once over, and come up with the following set of labels that aren't suitable for "no-instruments":

137, "Music"
the genres from 216:282, e.g. Electronic Music, Blues, Angry Music

Keep in mind, this would only occur in the case where a video had a genre tag, but no OpenMIC-relevant ones. In the latter case, these are already in the OpenMIC subset. Some quick stats:

2,041,789 video IDs in the unbalanced training set
198,336 video IDs that map to OpenMIC-23 instruments
In the remaining, 1,841,243 IDs, 875,129 meet the above criteria (generally tagged as some kind of "music", without relevant instrument tags), or ≈47% of the data.

Tough Questions:

Is it sane to filter all music from the null case?
What other options do we have?
@bmcfee, where is the mapping from AudioSet labels to OpenMIC 23 classes? I couldn't quickly find that anywhere?

bmcfee commented 6 years ago

If we treat AudioSet as a kind of weakly labeled collection, we might need to discard data that might contain our instrument classes.

Yeah, I support that. Consider it "strong negative" mining. The general problem here is that we have a PU or "Rumsfeld" problem: Rummy

Here's an idea for a more general thing we could do:

Look at the matrix of (items, labels), under the assumption that only the 1s are valid observations.
The idea: you might not get instrument tags (eg voice) but you could get a tag that strongly implies an instrument tag (eg blues).
Treat it as a missing data problem and try to impute the 1s as you would in a collaborative filter. As long as we have enough observations of co-occurrence, this should be possible.
After imputation, anything that has a low score for the instrument categories can be inferred as a strong negative example.

@bmcfee, where is the mapping from AudioSet labels to OpenMIC 23 classes? I couldn't quickly find that anywhere?

https://github.com/cosmir/dev-set-builder/blob/ejh_20171122_audioset_features/scripts/namespaces.py#L119

ejhumphrey commented 6 years ago

Are you implying that we'd try to impute instrument labels from genres based on metadata alone? If so, I think I'd rather throw away potentially ambiguous data.

I tried to draw this out, but I think a concise description of the set space might be cleaner:

(a) A small amount (10%) of the data has strong instrument labels
(b) Almost half of the rest is some kind of "music"
(c) The other half is either (a) not music or (b) not an OpenMIC-23 instrument

I think I'm happy sampling negative examples from (c), whereas (b) might be an interesting set to apply a trained model to afterwards. Also (c)'s sample size is pretty large, and should contain enough acoustic diversity to have the desired effect.

As for the mapping, I see now our discussion unfolded in #4, not a separate issue... I'll create one now for that to not clutter this thread.

bmcfee commented 6 years ago

Are you implying that we'd try to impute instrument labels from genres based on metadata alone? If so, I think I'd rather throw away potentially ambiguous data.

Yup, that's exactly it. But the point here is to identify which samples are ambiguous and which are not.

(c) The other half is either (a) not music or (b) not an OpenMIC-23 instrument

Isn't it actually "not known to be music/instrument"? The problem is finding the true non-musics from that subset. Of course, we could just assume they're all non-music, and be right a good chunk of the time.

ejhumphrey commented 6 years ago

Not all classes are equally strong / weak ... from what I've seen empirically, higher level concepts, e.g. speech or music, tend to be strongly labeled, whereas finer classes (ukulele) tend to be asymmetric.

Taking a step back, there are two broad "kinds" of content that we want to include in our negative examples:

(a) Sounds that are amusical
Musical sounds that are not the instruments of interest

Filtering content that has not been tagged as music, or with a particular genre, should capture (a), and some of (b) will come in through other non-OpenMIC-23 instruments.

ejhumphrey commented 6 years ago

another question that's cropped up ... do we want to preserve the eval set for model building? or shuffle everything together (unbalanced train + eval) and k-fold cross validate? I'd prefer the latter, but if there's already inertia for respecting the "test" set, then ...?

bmcfee commented 6 years ago

do we want to preserve the eval set for model building?

I think that depends on how big the eval set is / what the coverage per instrument is after filtering down to the openmic vocab. Any sense of that?

If the eval split is sufficiently large, our lives will be easier if we stick to a single split. (Our purpose here is to make a model, not evaluate the learning algorithm over choices of train/val splits.)

bmcfee commented 6 years ago

(Pushed the updated label index)

ejhumphrey commented 6 years ago

agreed – I'm also keen to gobble up as much low hanging fruit as possible, while minimally spoiling AudioSet's eval data.

distributions look like the following ... eval set is a little flatter.


# training percentage
array([  1.107 ,   0.6609,   0.9328,   3.365 ,   2.0558,   0.7977,
         2.1167,  10.4033 (drums),   1.8594,  22.5515 (guitar),   0.8363,   0.7671,
         2.8448,   0.8993,   1.3806,   5.053 ,   1.178 ,   1.9651,
         1.0645,   1.4785,   2.0653,  11.179 ,  23.4385 (voice)])

# test percentages
array([  2.2546,   2.2546,   2.2176,   4.4352,   2.4763,   2.2176,
         2.5872,   9.8832 (drums),   2.2176,  14.6363 (guitar),   2.2065,   2.2176,
         4.7309,   2.3285,   4.5461,   3.4743,   2.2915,   2.2176,
         2.3655,   2.3285,   2.2176,   2.2176,  21.6773 (voice)])```

bmcfee commented 6 years ago

Cool. What are the raw counts though? I'm mainly interested in making sure that the (instrument-constrained) validation set is large enough to be trustworthy.

ejhumphrey commented 6 years ago

total frames (seconds) as follows:

array([ 610,  610,  600, 1200,  670,  600,  700, 2674,  600, 3960,  597,
        600, 1280,  630, 1230,  940,  620,  600,  640,  630,  600,  600,
       5865])

which I parse as at least 60 examples per class... so, no, probably not stable, but human verified?

cosmir / dev-set-builder