[Testing] Evaluation metrics for comparing model performance to known standard

chrisbrickhouse commented 8 months ago

The tests so far are tightly coupled to the implementation rather than the interface. Changes to the internals often cause tests to fail because of small changes to timestamps or chunking (e.g. #5). This is bad because it slows development when tests need to be fixed despite nothing being broken. The main reason for this is that we don't want to push code that results in worse transcriptions than the prior version since that's an obvious regression. The proper way to test for this though would be to do something to code coverage tests and check that the new state is better than the previous one.

[] Implement word error rate metric (probably using jitsi/jiwer)
[] Implement metric for timestamp difference. First pass might just be correlation coefficient of the start and end timestamps. Also consider dynamic time warping
[] Calculate baseline metrics given a hand-corrected transcript/textgrid
[] Create workflow for PRs that compares new code's metrics against baseline

chrisbrickhouse commented 4 months ago

https://github.com/nryant/dscore

https://web.archive.org/web/20170119114252/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

chrisbrickhouse commented 4 months ago

More details on diarization scoring following more research on this (see links in prior comment)

The best library I've found for diarization scoring is nryant/dscore
We will focus on the following evaluation metrics supplied by dscore which seem the best fits for our expected use case, but all metrics will be supported:
- Diarization Error Rate from the NIST RT-09 evaluation plan, developed for "conference room" type data. This metric is easy to explain to non-computational linguists and works well for the conversational genre we expect fave-asr to be used on
- H(ref|sys) takes into account performance on overlapping speech segments which is important for the conversation genre and researchers looking at turn-taking. Because we expect this to be used in cases where a reference is not known, measuring how much additional information is needed to describe the reference given the system output (i.e., how much the system missed) is more useful than knowing how well the reference describes the system output.
- MI and NMI describe how helpful knowing one diarization is for figuring out the other. In our expected use case, this is really just a check on the H(ref|sys) since we expect users to be using sys to obtain a quasi-ref.

Forced-Alignment-and-Vowel-Extraction / fave-asr

[Testing] Evaluation metrics for comparing model performance to known standard #12