Slyne / ctc_decoder

A ctc decoder for both online and offline asr model
58 stars 26 forks source link

Add hotwords #11

Closed FieldsMedal closed 1 year ago

FieldsMedal commented 1 year ago

We tested hotwords on speech_asr_aishell1_hotwords_testsets.

  1. Acoustic model: a small Conformer model for AIShell

  2. Hotwords weight:hotwords.tar.gz

  3. Test method: please refer to the readme of this repository(TODO)

Latency And CER

model (FP16) Latency (s) CER
offline model 5.5921 13.85
offline model with hotwords 5.6401 12.16

offline model: https://github.com/wenet-e2e/wenet/tree/main/runtime/gpu/model_repo

offline model with hotwords(TODO):

Decoding result

Label hotwords pred w/o hotwords pred w/ hotwords
以及拥有陈露的女单项目 陈露 以及拥有陈鹭的女单项目 以及拥有陈露的女单项目
庞清和佟健终于可以放心地考虑退役的事情了 庞清
佟健
庞青董建终于可以放心地考虑退役的事情了 庞清佟健终于可以放心地考虑退役的事情了
赵继宏老板电器做厨电已经三十多年了 赵继宏 赵继红老板电器做厨店已经三十多年了 赵继宏老板电器做厨电已经三十多年了
yuekaizhang commented 1 year ago

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER? https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

FieldsMedal commented 1 year ago

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

hotwords weight and ngram file:https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/models. decoding results: https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/results.


Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

Hotwords result on speech_asr_aishell1_hotwords_testsets. model (FP16) Latency (s) CER Recall Precision F1-score
offline model w/o hotwords 5.5921 13.85 0.27 0.99 0.43
offline model w/ hotwords 5.6401 12.16 0.45 0.97 0.62


I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER? https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

Hotwords result on AISHELL-1 Test dataset model (FP16) RTF CER
offline model w/o hotwords 0.00437 4.6805
offline model w/ hotwords 0.00435 4.5831
streaming model w/o hotwords 0.01231 5.2777
streaming model w/ hotwords 0.01142 5.1926

Tested ENV

yuekaizhang commented 1 year ago

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

hotwords weight and ngram file:https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/models. decoding results: https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/results.

  • The current order of ngram is 4, only support length <= 4 hotwords, if you want to configure longer hotwords, you can use higher order ngram, but at the same time will increase the decoding time.

Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

Hotwords result on speech_asr_aishell1_hotwords_testsets.

model (FP16) Latency (s) CER Recall Precision F1-score offline model w/o hotwords 5.5921 13.85 0.27 0.99 0.43 offline model w/ hotwords 5.6401 12.16 0.45 0.97 0.62

I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER? https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

Hotwords result on AISHELL-1 Test dataset

model (FP16) RTF CER offline model w/o hotwords 0.00437 4.6805 offline model w/ hotwords 0.00435 4.5831 streaming model w/o hotwords 0.01231 5.2777 streaming model w/ hotwords 0.01142 5.1926

Tested ENV

  • CPU:40 Core, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
  • GPU:NVIDIA GeForce RTX 2080 Ti

Many thanks. The results look nice. I was wondering for both w/ or w/o hotwords case if we use this as the default external LM https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/blob/main/models/init_kenlm.arpa.

Also, is the pretrained model from here https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/s0#u2-conformer-result ? Looks like WER with WFST Decoding + attention rescoring for offline and chunk16 are 4.4 & 4.75. Pure attention rescoring without any ngram are 4.63&5.05. Not sure the results look like if you use aishell train set as arpa. I thought they use this 3-gram arpa https://huggingface.co/yuekai/aishell1_tlg_essentials/blob/main/3-gram.unpruned.arpa here.

FieldsMedal commented 1 year ago

Many thanks. The results look nice. I was wondering for both w/ or w/o hotwords case if we use this as the default external LM https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/blob/main/models/init_kenlm.arpa.

Also, is the pretrained model from here https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/s0#u2-conformer-result ? Looks like WER with WFST Decoding + attention rescoring for offline and chunk16 are 4.4 & 4.75. Pure attention rescoring without any ngram are 4.63&5.05. Not sure the results look like if you use aishell train set as arpa. I thought they use this 3-gram arpa https://huggingface.co/yuekai/aishell1_tlg_essentials/blob/main/3-gram.unpruned.arpa here.

  1. init_kenlm.arpa is used to init Score, because our hotwordsboosting depends on Scorer::make_ngram, any language model trained by Kenlm is fine. if decode with hotwords, put unique hot words from each recording into batchhotwords. if decode without hotwords, put None into batchhotwords. whether to add hotwords score controled by this. if user also wants to add ngram score, set use_ngram_score to true.
  2. Our pretrained model from here https://github.com/wenet-e2e/wenet/blob/main/docs/pretrained_models.md, trained by aishell datasets. Our results on AISHELL-1 Test dataset were tested using the fp16 onnx model,ctc_weight is 0.3, reverse_weight is 0.3. These settings may have some impact.
FieldsMedal commented 1 year ago

In the latest commit, we modify batch_hotwords_scorer to hotwords_scorer. If you have free time, please help to review this pr.

Slyne commented 1 year ago

@FieldsMedal Thanks! One more question: The output (Test hotwords boosting with word-level language models during ctc prefix beam search) for test_zh.py is

INFO:root:Test hotwords boosting with word-level language models during ctc prefix beam search
INFO:root:('', '一', '换', '一首', '极点晚', '几点啦', '极点', '几点', '', '几', '晚', '极')

Not sure if the above result is expected?

================= Update: Should be fine. It is the user's responsibility to ensure the vocabulary contains space_id.

Slyne commented 1 year ago

Thanks again! Really great feature! @FieldsMedal