Other diarization metrics

desh2608 / spyder

Simple Python package for fast DER computation

MIT License

32 stars 7 forks source link

Other diarization metrics #3

Open desh2608 opened 3 years ago

desh2608 commented 3 years ago

Following metrics (from pyannote and dscore) may be implemented:

[x] Diarization error rate (DER)
[ ] Jaccard error rate (JER)
[ ] Purity and coverage
[ ] Bcubed precision/recall
[ ] Goodman-Kruskal Tau
[ ] Mutual information

nryant commented 1 year ago

Are JER/clustering metrics still of interest? I'd be up for adding them if I know the PRs would get accepted.

desh2608 commented 1 year ago

Hi Neville! Yeah, that would be awesome. JER is top-most on the list, but I can imagine people would be interested in other metrics as well.

(@popcornell and I want to switch from dscore to spyder in CHiME-7 DASR, but it is blocked by JER not being implemented yet.)

nryant commented 1 year ago

Ok, I can add this to the TODO list. I'm in the process of rewriting dscore to eliminate the md-eval dependency and output more detailed reporting. The initial version is based on pyannote.metrics, but between the penalty of Python being an interpreted language and the repeated calls to uemify, it's not particularly quick. So, it's in my interest to get faster implementations of the various metrics and I'd rather contribute to an existing project if possible.

desh2608 commented 1 year ago

Cool! Your contributions would be very welcome. In my benchmarking, I found pyannote.metrics to be an order of magnitude slower than md-eval.pl --- pyannote is a great tool overall, just not suitable for DER evaluation :)

I'm sure spyder would benefit immensely from your expertise. Please use this thread for any questions/discussions once you get around to implementing the metrics.

nryant commented 1 year ago

That sounds about right. When I benchmarked on the DIHARD III eval (full) condition, just the DER computation (omitting IO and building the Annotation/Timeline instances in memory) averaged over 13 seconds; cf. to 3.5 seconds for running md-eval. Most of this comes from the call to IdentificationErrorRate.uemify that constructs the equivalent of your get_eval_regions. Specifically, this block, which accounts for 10 seconds of that run time.

I've been updating dscore off-and-on for the past week for an LDC internal project and want to finish that work first, but will look into implementing JER in spy-der after. I think it should be relatively straightforward.