fgnt / meeteval

MeetEval - A meeting transcription evaluation toolkit
MIT License
75 stars 14 forks source link

Computing time-constrained WER #17

Open desh2608 opened 1 year ago

desh2608 commented 1 year ago

I am thinking of a metric for long-form ASR and segmentation. Consider the following scenario:

If reference is STM and hypothesis is CTM, this may correspond to computing the asclite aWER metric, but we also want to support (i) other kinds of systems that may not provide word-level timestamps, and (ii) tighter penalty on segmentation by providing reference CTM.

Additionally, we also want to be able to include multiple possible references (e.g., references may be orthographic or normalized in some way), although I understand that this may be beyond the scope of this toolkit.

I am looking for suggestions about what would be a good metric (if one exists) for this scenario.

(cc @MartinKocour since we were having related discussions.)

thequilo commented 1 year ago

There is not straightforward answer to you problem, but the following might help to start a discussion.

We classify the WER algorithms with these three properties:

Could you elaborate further your requirements regarding the properties defined above? Especially wether you have/want to use diarization labels.

If reference is STM and hypothesis is CTM, this may correspond to computing the asclite aWER metric,

Can you clarify what exactly you mean with asclite aWER? Is it the WER that is used in the libri-CSS publication?

From our understanding, tha asclite WER from the libri-CSS publication does the following:

Currently we are working on a WER that considers diarization and time stamps (Word or segment level), you can find it as tcpWER in this package, but we haven't decided yet, which hyperparameters we want to suggest. We plan to publish it for the CHiME workshop.

Additionally, we also want to be able to include multiple possible references (e.g., references may be orthographic or normalized in some way), although I understand that this may be beyond the scope of this toolkit.

This is not "beyond the scope of this toolkit", but it is beyond our know-how. We think a normalization is kind of orthogonal to the actual WER calculation and for that it might be better to use externel tools from people that have more experience in this topic (e.g. language model people). One idea would be to use kaldi, but we haven't thought about this until know. We are open for suggestions.

We have some more plans, but they are in an early stage and we don't want to talk about those yet in public. If you want, we could schedule a meeting or write in a slack channel to find your desired WER.

[^1]: With diarization we mean, that the segments of the same speaker gets the same label assigned from the system. The WER should then find the best assignment, like it is done in cpWER. Without diarization, the estimated label is ignored and and the assignment is determined independently between segments/words.

desh2608 commented 1 year ago

Some clarifications:

By asclite WER, I meant exactly what you described. One problem with this metric seems to be that references are "loose" (i.e. STM files).

By "normalization" I meant providing multiple possible references (similar to what is done to compute Bleu scores in MT).

boeddeker commented 1 year ago

By "normalization" I meant providing multiple possible references (similar to what is done to compute Bleu scores in MT).

Could you give an example, where multiple possible references are useful? If I remember correctly, the asclite tool mentioned, that they support this (don't know how), but all examples that I can imagine could be achieved via normalization. But I lack experience in this field.

In MT this is different, because translations have more degree of freedoms.

There are a few issues, when we would allow a "graph" instead of a sequence of words for the reference:

desh2608 commented 1 year ago

In ASR, SCLITE and ASCLITE handle this through "GLM files". Basically, you provide rules for alternative references of words or phrases such as I'm --> I am. Within the scoring tool, the reference is created as an acyclic directed graph (ADG) with multiple paths for the alternate references. The multi-dimensional Levenstein distance is then computed over ADGs of reference and hypothesis, instead of linear chains. I guess this is feasible in their case since we have time-marked segments. Without them, as you mention, the complexity would be very large.

In any case, this is a "desirable" but not "necessary" property to have. My main purpose in creating this issue was basically to get your insights on what kind of metrics would work for the task of long-form ASR and segmentation.

Edit: If you are planning to attend ICASSP, we can have more discussions then :)

boeddeker commented 1 year ago

Thanks for the explanation. Yes, with the timing information, the complexity can be significantly reduced. We will keep this in mind, but I don't know if we will find the time to figure out if a solution with reasonable complexity exists and then have the time to implement it. Partially, it can be solved by preprocessing.

My main purpose in creating this issue was basically to get your insights on what kind of metrics would work for the task of long-form ASR and segmentation.

There are different long-form ASR and segmentation systems and they are differently evaluated.

Let's say you build a "CSS pipeline" [1]. Some people stop before the Diarization and want to evaluate "Separation + ASR". In this case, you don't know the speaker labels for the segments. In such a situation, you could use asclite or ORC-WER [2]. Where asclite considers the temporal information, while ORC-WER doesn't.

When you build a system that yields a "Speaker-attributed Transcription" with temporal information, the asclite tool ignores the "Speaker-attributed" part of your estimation. For this situation, we implemented a time-Constrained levenshtein distance and replaced the classical levenshtein distance in cpWER: Time-Constrained minimum Permutation Word Error Rate (tcpWER)

We provide several options to address for different accuracies between reference and hyposisis. With "ctm" estimates, equidistant_intervals and no collar you can only get a "correct" or substitution error, when words overlap. This might be what you want.

Edit: If you are planning to attend ICASSP, we can have more discussions then :)

I am not there, but Thilo will attend the conference.

[1] https://arxiv.org/pdf/2011.02014.pdf [2] Actually, MIMO-WER and nor ORC-WER is want you want to calculate, but ORC-WER is faster.

desh2608 commented 1 year ago

Thanks. For the models we are using now, we don't have speaker attribution. I am actually using the asclite WER at the moment, so it seems we are on the same page about that.

thequilo commented 1 year ago

I'll add a few more comments: