Open chrisbrickhouse opened 8 months ago
More details on diarization scoring following more research on this (see links in prior comment)
dscore
which seem the best fits for our expected use case, but all metrics will be supported:
H(ref|sys)
takes into account performance on overlapping speech segments which is important for the conversation genre and researchers looking at turn-taking. Because we expect this to be used in cases where a reference is not known, measuring how much additional information is needed to describe the reference given the system output (i.e., how much the system missed) is more useful than knowing how well the reference describes the system output.MI
and NMI
describe how helpful knowing one diarization is for figuring out the other. In our expected use case, this is really just a check on the H(ref|sys)
since we expect users to be using sys
to obtain a quasi-ref.
The tests so far are tightly coupled to the implementation rather than the interface. Changes to the internals often cause tests to fail because of small changes to timestamps or chunking (e.g. #5). This is bad because it slows development when tests need to be fixed despite nothing being broken. The main reason for this is that we don't want to push code that results in worse transcriptions than the prior version since that's an obvious regression. The proper way to test for this though would be to do something to code coverage tests and check that the new state is better than the previous one.