SpeechColab / GigaSpeech

Large, modern dataset for speech recognition
Apache License 2.0
649 stars 62 forks source link

Adding scoring scripts. #24

Open chenguoguo opened 3 years ago

chenguoguo commented 3 years ago

We should provide scoring scripts (e.g., for normalization) so that results from different toolkits are comparable.

dophist commented 3 years ago

As Kaldi recipe develpment is converging, it's time to think about how we organize this text normalization as a post processing before WER calculation.

The processing is pretty simple, containing:

  1. removal of "AH UM UH ER ERR HM ..." etc from both REF and HYP, to avoid error counting like

    REF:AH THIS ... 
    HYP:UH THIS ...

    these words are just meaningless in WER computation, and it's really HARD to keep them consistent between human transcriber and models(say AH UH ER may sound identical, human-annotators/models may yield these things differently from time to time)

  2. removal of "-" hyphen from both REF and HYP, to avoid error counting like:

    REF: T-SHIRT
    HYP: T SHIRT

    hyphen is somehow more frequent than I expected, our training text TN kept hyphen because T-SHIRT is indeed a meaningful word other than T SHIRT. But in testing and evaluation, removing them gives more robust and reasonable WER numbers.

Google API has shown these processing can result a WER difference up to 1-2% absolute or even more, so this is necessary for consistent/fair comparison in spontaneous speech.

Now the problem is how we organize the post-processing:

options that I can think of for now:

  1. we can provide different text processing scripts for each downstream frameworks, e.g. GIGASPEECH_REPO/toolkits/{kaldi,espnet,...}/asr-text-post-processing.py. in this case, we KNOW the detailed format of each toolkits. With agreement, these can also go into downstream recipe code directly.
  2. or we can have a single exampler python function/awk-command in GIGASPEECH_REPO/util/ that deals with pure text instead of dealing with specific formatted files, and let downstream recipe developers decide to import/refer in their code when they feel appropriate.

Which one is better? Any preferences or better suggestion? I personally vote for the first solution. @wangyongqing0731 @chenguoguo @sw005320

chenguoguo commented 3 years ago

I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind:

  1. Under each toolkit, we have a script to handle the post processing, which takes care of the toolkit specific stuff, e.g., toolkits/kaldi/gigaspeech_asr_post_processing.sh
  2. The toolkit specific internally call a common script, e.g., utils/asr_post_processing.sh, which does the actual work. This way, if we have to update the post processing, we only have to update one place.

One thing that I haven't decided yet is, whether or not we should provide the scoring tool. If yes, we can make sure that everyone is using the same tool for scoring and the results will be comparable. But it definitely involves more work.

sw005320 commented 3 years ago

I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind:

  1. Under each toolkit, we have a script to handle the post processing, which takes care of the toolkit specific stuff, e.g., toolkits/kaldi/gigaspeech_asr_post_processing.sh
  2. The toolkit specific internally call a common script, e.g., utils/asr_post_processing.sh, which does the actual work. This way, if we have to update the post processing, we only have to update one place.

Sounds good to me.

One thing that I haven't decided yet is, whether or not we should provide the scoring tool. If yes, we can make sure that everyone is using the same tool for scoring and the results will be comparable. But it definitely involves more work.

The reason I stick to use sclite is that it is made by NIST and has been used in various ASR official benchmarks. sclite also has various analysis tools. As far as I know, Kalid scoring produces the same result, so it should be no problem.

If we use the different toolkit, I recommend at least output the number of total words/sentences in the reference and possibly sub/del/ins breakdown, e.g.,

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/dev 2043 51075 92.9 4.5 2.6 2.1 9.2 65.6
decode_asr_asr_model_valid.acc.ave/test 9627 175116 90.5 7.0 2.5 6.1 15.6 69.3

As long as the number of total words/sentences is the same, it is comparable. (we can also easily detect if something is wrong in the data preparation or normalization when we check it). sub/del/ins error breakdown can be used to detect some DP matching issues in the edit distance computation, and some format errors (e.g., in the above case, there is significantly large insertion errors in the test set. We may have some alignment or reference issues, and I actually found them based on this number and already reported it to you).

dophist commented 3 years ago

I just added a simple scoring tool via https://github.com/SpeechColab/GigaSpeech/pull/35 , it uses sclite to evaluate REF and HYP.

Before evaluation, the tool applies very simple text processing that we discussed in this issue topic, and I think we should keep this processing simple and stable after release.

Besides, the tool is organized in utils/ instead of toolkits/xxx/ , because sclite is framework independent. Recipe developers/researchers can use this when they really want an apple-to-apple evaluation comparison.

dophist commented 3 years ago

We finally provides a recommended scoring script based on sclite https://github.com/SpeechColab/GigaSpeech/blob/main/utils/gigaspeech_scoring.py , researchers may use this tool if they want consistent comparison across different systems.

We'd better leave this ISSUE open for a while, so people may read above discussion.