Evaluating speaker diarization systems with γ inter-annotator agreement

hbredin commented 3 years ago

Nice package 👍

I am wondering whether it would make sense to use γ inter-annotator agreement for evaluation speaker diarization systems (in place of good old diarization error rate, aka DER):

one annotator would be the (manual) reference annotation,
another annotator would be the annotation hypothesized by the (automatic) diarization system.

I understand (maybe incorrectly) that both annotators need to use the same set of speaker labels.

How would you handle the case where both annotators use different sets of labels? Would you need to match them first (like what is already done in DER)?

How would you choose (temporal) alpha and (categorical) beta weights?

hadware commented 3 years ago

Thanks sensei!

It would indeed make sense, and we've intensely thought about it! We're just not entirely sure yet...

Regarding the reference/hypothesis problem, we kind of have a very experimental solution already implemented (that we did not document yet because it hasn't really been tested or proven to work correctly). This solution is, when computing the gamma, to set one (or more) annotators as "ground truth", which will then tell pygamma-agreement to only sample random annotations from these ground truth annotators.
regarding the case where annotators use a different set of labels, we would indeed have to match them first (in the same way as DER does it). As a side note, it would still work pretty nicely with the current CombinedCategoricalDissimilarity implementation, using the cat_dissimilarity_matrix.
regarding alpha and beta, I'm leaving it to @Rachine :thinking: :

Rachine commented 3 years ago

Hello Hervé! Thank you! This is a question we want to explore and we discuss a lot!

We tried to apply the γ to replace IER, the behaviors were not consistent at all. I think the framework are very similar, but there are differences to take into account. I think the gamma has some limitations and need adaptations.

To use the gamma as a metric, we do not want the gamma chance correction to use the hypothesized annotations. It means that we do not want the agreement to vary across diarization systems. There is one option to specify this https://github.com/bootphon/pygamma-agreement/blob/master/pygamma_agreement/continuum.py#L173
alpha/beta are parameters to set manually. For instance, if we have two segments of duration of one second in the two annotations, to set alpha=beta=1 means that we attribute the same weight to a mistake to displace one of the segment of 1s or to make a category mistake.
There is one main difference with classic DER it is the splitting of units. it is not penalized at all (except for Missed Detection) to split a speech unit. Yet, as the γ finds an alignement, Speech Diarization systems are a lot penalized. We can extend the gamma to take into account multiple alignement paths (hard way). Or Diarization Systems need to have the same biases as human to improve the gamma... (second hard way).
for the problem of assignement in Diarization, if you do not have many classes you can rotate all the matching and take the smallest gamma I imagine?

hbredin commented 3 years ago

Thank you both for your detailed answers.

To summarize my understanding: using this metric for speaker diarization is not that obvious and remains an open research question.

Thinking out loud: maybe its use for combining multiple speaker diarization systems would be something to look at, as well (in the same spirit as in https://github.com/desh2608/dover-lap/ by @desh2608)

bootphon / pygamma-agreement

Evaluating speaker diarization systems with γ inter-annotator agreement #16