Add hotwords - Githubissues

FieldsMedal commented 1 year ago

We tested hotwords on speech_asr_aishell1_hotwords_testsets.

Acoustic model: a small Conformer model for AIShell
Hotwords weight：hotwords.tar.gz
Test method: please refer to the readme of this repository(TODO)

Latency And CER

model (FP16)	Latency (s)	CER
offline model	5.5921	13.85
offline model with hotwords	5.6401	12.16

offline model: https://github.com/wenet-e2e/wenet/tree/main/runtime/gpu/model_repo

offline model with hotwords(TODO):

Decoding result

Label	hotwords	pred w/o hotwords	pred w/ hotwords
以及拥有陈露的女单项目	陈露	以及拥有陈鹭的女单项目	以及拥有陈露的女单项目
庞清和佟健终于可以放心地考虑退役的事情了	庞清佟健	庞青和董建终于可以放心地考虑退役的事情了	庞清和佟健终于可以放心地考虑退役的事情了
赵继宏老板电器做厨电已经三十多年了	赵继宏	赵继红老板电器做厨店已经三十多年了	赵继宏老板电器做厨电已经三十多年了

yuekaizhang commented 1 year ago

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER? https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

FieldsMedal commented 1 year ago

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

hotwords weight and ngram file:https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/models. decoding results: https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/results.

The current order of ngram is 4, only support length <= 4 hotwords, if you want to configure longer hotwords, you can use higher order ngram, but at the same time will increase the decoding time.

Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

Hotwords result on speech_asr_aishell1_hotwords_testsets.	model (FP16)	Latency (s)	CER	Recall	Precision	F1-score
offline model w/o hotwords	5.5921	13.85	0.27	0.99	0.43
offline model w/ hotwords	5.6401	12.16	0.45	0.97	0.62

I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER? https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

Hotwords result on AISHELL-1 Test dataset	model (FP16)	RTF
offline model w/o hotwords	0.00437	4.6805
offline model w/ hotwords	0.00435	4.5831
streaming model w/o hotwords	0.01231	5.2777
streaming model w/ hotwords	0.01142	5.1926

Tested ENV

CPU：40 Core, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
GPU：NVIDIA GeForce RTX 2080 Ti

yuekaizhang commented 1 year ago

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

hotwords weight and ngram file:https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/models. decoding results: https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/results.

The current order of ngram is 4, only support length <= 4 hotwords, if you want to configure longer hotwords, you can use higher order ngram, but at the same time will increase the decoding time.

Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

Hotwords result on speech_asr_aishell1_hotwords_testsets.

model (FP16) Latency (s) CER Recall Precision F1-score offline model w/o hotwords 5.5921 13.85 0.27 0.99 0.43 offline model w/ hotwords 5.6401 12.16 0.45 0.97 0.62

I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER? https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

Hotwords result on AISHELL-1 Test dataset

model (FP16) RTF CER offline model w/o hotwords 0.00437 4.6805 offline model w/ hotwords 0.00435 4.5831 streaming model w/o hotwords 0.01231 5.2777 streaming model w/ hotwords 0.01142 5.1926

Tested ENV

CPU：40 Core, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz

GPU：NVIDIA GeForce RTX 2080 Ti

Many thanks. The results look nice. I was wondering for both w/ or w/o hotwords case if we use this as the default external LM https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/blob/main/models/init_kenlm.arpa.

Also, is the pretrained model from here https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/s0#u2-conformer-result ? Looks like WER with WFST Decoding + attention rescoring for offline and chunk16 are 4.4 & 4.75. Pure attention rescoring without any ngram are 4.63&5.05. Not sure the results look like if you use aishell train set as arpa. I thought they use this 3-gram arpa https://huggingface.co/yuekai/aishell1_tlg_essentials/blob/main/3-gram.unpruned.arpa here.

FieldsMedal commented 1 year ago

Many thanks. The results look nice. I was wondering for both w/ or w/o hotwords case if we use this as the default external LM https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/blob/main/models/init_kenlm.arpa.

Also, is the pretrained model from here https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/s0#u2-conformer-result ? Looks like WER with WFST Decoding + attention rescoring for offline and chunk16 are 4.4 & 4.75. Pure attention rescoring without any ngram are 4.63&5.05. Not sure the results look like if you use aishell train set as arpa. I thought they use this 3-gram arpa https://huggingface.co/yuekai/aishell1_tlg_essentials/blob/main/3-gram.unpruned.arpa here.

init_kenlm.arpa is used to init Score, because our hotwordsboosting depends on Scorer::make_ngram, any language model trained by Kenlm is fine. if decode with hotwords, put unique hot words from each recording into batchhotwords. if decode without hotwords, put None into batchhotwords. whether to add hotwords score controled by this. if user also wants to add ngram score, set use_ngram_score to true.
Our pretrained model from here https://github.com/wenet-e2e/wenet/blob/main/docs/pretrained_models.md, trained by aishell datasets. Our results on AISHELL-1 Test dataset were tested using the fp16 onnx model,ctc_weight is 0.3, reverse_weight is 0.3. These settings may have some impact.

FieldsMedal commented 1 year ago

In the latest commit, we modify batch_hotwords_scorer to hotwords_scorer. If you have free time, please help to review this pr.

Slyne commented 1 year ago

@FieldsMedal Thanks! One more question: The output (Test hotwords boosting with word-level language models during ctc prefix beam search) for test_zh.py is

INFO:root:Test hotwords boosting with word-level language models during ctc prefix beam search
INFO:root:('', '一', '换', '一首', '极点晚', '几点啦', '极点', '几点', '', '几', '晚', '极')

Not sure if the above result is expected?

================= Update: Should be fine. It is the user's responsibility to ensure the vocabulary contains space_id.

Slyne commented 1 year ago

Thanks again! Really great feature! @FieldsMedal

Slyne / ctc_decoder

Add hotwords #11

Latency And CER

Decoding result

Tested ENV

Tested ENV