Open chenguoguo opened 3 years ago
As Kaldi recipe develpment is converging, it's time to think about how we organize this text normalization as a post processing before WER calculation.
The processing is pretty simple, containing:
removal of "AH UM UH ER ERR HM ..." etc from both REF and HYP, to avoid error counting like
REF:AH THIS ...
HYP:UH THIS ...
these words are just meaningless in WER computation, and it's really HARD to keep them consistent between human transcriber and models(say AH UH ER may sound identical, human-annotators/models may yield these things differently from time to time)
removal of "-" hyphen from both REF and HYP, to avoid error counting like:
REF: T-SHIRT
HYP: T SHIRT
hyphen is somehow more frequent than I expected, our training text TN kept hyphen because T-SHIRT is indeed a meaningful word other than T SHIRT. But in testing and evaluation, removing them gives more robust and reasonable WER numbers.
Google API has shown these processing can result a WER difference up to 1-2% absolute or even more, so this is necessary for consistent/fair comparison in spontaneous speech.
Now the problem is how we organize the post-processing:
options that I can think of for now:
Which one is better? Any preferences or better suggestion? I personally vote for the first solution. @wangyongqing0731 @chenguoguo @sw005320
I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind:
One thing that I haven't decided yet is, whether or not we should provide the scoring tool. If yes, we can make sure that everyone is using the same tool for scoring and the results will be comparable. But it definitely involves more work.
I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind:
- Under each toolkit, we have a script to handle the post processing, which takes care of the toolkit specific stuff, e.g., toolkits/kaldi/gigaspeech_asr_post_processing.sh
- The toolkit specific internally call a common script, e.g., utils/asr_post_processing.sh, which does the actual work. This way, if we have to update the post processing, we only have to update one place.
Sounds good to me.
One thing that I haven't decided yet is, whether or not we should provide the scoring tool. If yes, we can make sure that everyone is using the same tool for scoring and the results will be comparable. But it definitely involves more work.
The reason I stick to use sclite
is that it is made by NIST and has been used in various ASR official benchmarks. sclite
also has various analysis tools. As far as I know, Kalid scoring produces the same result, so it should be no problem.
If we use the different toolkit, I recommend at least output the number of total words/sentences in the reference and possibly sub/del/ins breakdown, e.g.,
dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
decode_asr_asr_model_valid.acc.ave/dev | 2043 | 51075 | 92.9 | 4.5 | 2.6 | 2.1 | 9.2 | 65.6 |
decode_asr_asr_model_valid.acc.ave/test | 9627 | 175116 | 90.5 | 7.0 | 2.5 | 6.1 | 15.6 | 69.3 |
As long as the number of total words/sentences is the same, it is comparable. (we can also easily detect if something is wrong in the data preparation or normalization when we check it). sub/del/ins error breakdown can be used to detect some DP matching issues in the edit distance computation, and some format errors (e.g., in the above case, there is significantly large insertion errors in the test set. We may have some alignment or reference issues, and I actually found them based on this number and already reported it to you).
I just added a simple scoring tool via https://github.com/SpeechColab/GigaSpeech/pull/35 , it uses sclite to evaluate REF and HYP.
Before evaluation, the tool applies very simple text processing that we discussed in this issue topic, and I think we should keep this processing simple and stable after release.
Besides, the tool is organized in utils/
instead of toolkits/xxx/
, because sclite is framework independent. Recipe developers/researchers can use this when they really want an apple-to-apple evaluation comparison.
We finally provides a recommended scoring script based on sclite https://github.com/SpeechColab/GigaSpeech/blob/main/utils/gigaspeech_scoring.py , researchers may use this tool if they want consistent comparison across different systems.
We'd better leave this ISSUE open for a while, so people may read above discussion.
We should provide scoring scripts (e.g., for normalization) so that results from different toolkits are comparable.