the evaluation metrics picked in the implementation is not really measuring forced alignment results, but measuring the degree of total time covered by the hypothesis timeframes. In other words, the implementation was using "detection" metrics, while what we want is segment-by-segment coverage.
I'd like to re-implement the evaluate.py with a new set of hypothesis MMIF files ran on the non-gold text using other metrics available in the pyannote library.
Evaluation on
gentle-wrapper
in https://github.com/clamsproject/aapb-evaluations/pull/29 should be re-done sinceI'd like to re-implement the
evaluate.py
with a new set of hypothesis MMIF files ran on the non-gold text using other metrics available in thepyannote
library.