RicherMans / Datadriven-GPVAD

The codebase for Data-driven general-purpose voice activity detection.
MIT License
93 stars 23 forks source link

How was the ground truth in the article be set? How to get it? #16

Open Ko-vey opened 2 years ago

Ko-vey commented 2 years ago

How was the ground truth in the article be set? How to get it?

RicherMans commented 2 years ago

Sorry can you explain exactly what you refer to as the ground truth?

RicherMans commented 2 years ago

Its from the DCASE 2018 and 2019 datasets, they strongly labeled their evaluation datasets.

Ko-vey commented 2 years ago

Sorry can you explain exactly what you refer to as the ground truth?

2022-05-17 205050 2022-05-17 205126

Thanks for your brilliant work and patience ! There are a few questiones I want to know:

  1. Like the pictures shown above, how the ground truth label of speech activation period was set for evaluation and comparison?
  2. How do we know the performance of a student model trained with the help of teacher model( t1 or t2 ) without exact frame-level label on a new dataset?
  3. I am recently working on birdcall activation detection task based on your model, but by replacing the speech label with birdcall label in teacher-student approach on Audioset balanced subset, the new student model seemed to learn nothing from t1 and performed poorly on bird audio file. Could you give some advice on how to train a proper model ?
RicherMans commented 2 years ago

Oh hey, yeah no problem with these questions:

Like the pictures shown above, how the ground truth label of speech activation period was set for evaluation and comparison?

Its manually labeled by the DCASE authors, nothing special here, its not predicted by any of my models. All these datasets are publicly available here.

How do we know the performance of a student model trained with the help of teacher model( t1 or t2 ) without exact frame-level label on a new dataset?

I mean you can use some external dataset for cross-validation during training. I forgot what I did use for validation or if I did any. Usually this approach should work.

I am recently working on birdcall activation detection task based on your model, but by replacing the speech label with birdcall label in teacher-student approach on Audioset balanced subset, the new student model seemed to learn nothing from t1 and performed poorly on bird audio file. Could you give some advice on how to train a proper model ?

Oh yeah, that's an interesting task! So my current teacher model is pretty bad in comparison to some other models on Audioset. But you need to recall that like 40% of all labels in audioset are speech and also that the labeling procedure of this ``Speech'' label is rather precise ( because I mean its speech, nothing complicated ). Just recall that my model has seen at least ~ 2 Million samples containing speech. On the other hand, you task using birds is much more complicated. Further the labels in audioset might not be "optimal" to say the least, since many labels describe birds. Also "bird" related classes are pretty rare compared to "speech". Even though they might achieve a high mAP on the dataset, it does not mean that the model can effectively predict these classes.

I recommend you to actually fine-tune your model first on a bird-specific dataset, then re-estimate on the balanced dataset and then train a student. It might be worth a try!