cosmir / openmic-2018

Tools and tutorials for the OpenMIC-2018 dataset.
MIT License
91 stars 10 forks source link

full data annotations #26

Open anaelisa24 opened 6 years ago

anaelisa24 commented 6 years ago

Hi, I'm working on a project about sound and instrument classification using active learning. We ran some initial experiments using OpenMic for only binary classification but would like to make it multilabel. For this we would need to have some part of the data fully annotated so we can actually run some tests. I would greatly appreciate your input and help.

bmcfee commented 6 years ago

The eventual goal is to get the entire dataset completely annotated, but it would be nice off the bat to have a smaller slice with complete annotations, both for model development and evaluation.

Two options come to mind:

  1. Pull out a subset of the openmic2018 data and crowdsource full annotations. This will be costly, so the set will have to be relatively small (1000 tops, i'm guessing). We'll have to work a bit to make sure the coverage is good.
  2. Pull an independent set of clips from the larger FMA pool that openmic2018 came from, using similar ranking and quantile sampling strategies (per instrument), then source complete annotations. This way, we avoid any potential contimation / long-term overfitting on openmic2018, but still get a representative sample of full annotations.

(2) is obviously more work, but I think it's doable, and better all around. What do others think?