Open hbredin opened 3 years ago
Thanks sensei!
It would indeed make sense, and we've intensely thought about it! We're just not entirely sure yet...
Regarding the reference/hypothesis problem, we kind of have a very experimental solution already implemented (that we did not document yet because it hasn't really been tested or proven to work correctly). This solution is, when computing the gamma, to set one (or more) annotators as "ground truth", which will then tell pygamma-agreement
to only sample random annotations from these ground truth annotators.
regarding the case where annotators use a different set of labels, we would indeed have to match them first (in the same way as DER does it). As a side note, it would still work pretty nicely with the current CombinedCategoricalDissimilarity
implementation, using the cat_dissimilarity_matrix
.
regarding alpha
and beta
, I'm leaving it to @Rachine :thinking: :
Hello Hervé! Thank you! This is a question we want to explore and we discuss a lot!
We tried to apply the γ to replace IER, the behaviors were not consistent at all. I think the framework are very similar, but there are differences to take into account. I think the gamma has some limitations and need adaptations.
hypothesized
annotations. It means that we do not want the agreement to vary across diarization systems. There is one option to specify this https://github.com/bootphon/pygamma-agreement/blob/master/pygamma_agreement/continuum.py#L173 Thank you both for your detailed answers.
To summarize my understanding: using this metric for speaker diarization is not that obvious and remains an open research question.
Thinking out loud: maybe its use for combining multiple speaker diarization systems would be something to look at, as well (in the same spirit as in https://github.com/desh2608/dover-lap/ by @desh2608)
Nice package 👍
I am wondering whether it would make sense to use γ inter-annotator agreement for evaluation speaker diarization systems (in place of good old diarization error rate, aka DER):
I understand (maybe incorrectly) that both annotators need to use the same set of speaker labels.
How would you handle the case where both annotators use different sets of labels? Would you need to match them first (like what is already done in DER)?
How would you choose (temporal)
alpha
and (categorical)beta
weights?