In this pr,fix language model score repeatedly. When hotwords_scorer->is_character_based and ext_scorer->is_character_based() is false,The language model and hot word scores will be repeatedly calculated. In fact, if the language model is word based , it will only call the scorer whenever space_id is detected. After modification,
we tested all possibilities on the dataset.
first audio
set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space
),alpha =0.5 ,beta=0.5,window_length=4. hot_words = {'换一': -3.40282e+38, '首歌': -100, '换首歌': 3.40282e+38}
编号
模型
热词is_character_based
语言模型is_character_based
解码结果(best path)
1
都不使用
*
*
换一首歌
2
热词
TRUE
*
换首歌a<unk>
3
FALSE
*
换首歌<space>A<space>爱'爱<unk>
4
语言
*
TRUE
换一首歌
5
*
FALSE
换一首
6
热词+语言
TRUE
TRUE
换换首歌<unk>
7
TRUE
FALSE
一首
8
FALSE
TRUE
换首歌<space>A<space>爱'爱<unk>
9
FALSE
FALSE
换一首
No. 7 and No. 9 hot words did not take effect. When the language model is_character_based is false, Words generated between two spaces should be in 1-grams or is a prefix of 1-grams. hotwords '换首歌' not in 1-grams.
second audio
set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space
),alpha =0.5 ,beta=0.5,window_length=4. hot_words = {'极点': 550}.Set the space to <space>before compiling ctc_decoder.
In this pr,fix language model score repeatedly. When hotwords_scorer->is_character_based and ext_scorer->is_character_based() is false,The language model and hot word scores will be repeatedly calculated. In fact, if the language model is word based , it will only call the scorer whenever space_id is detected. After modification, we tested all possibilities on the dataset.
first audio
set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space ),alpha =0.5 ,beta=0.5,window_length=4. hot_words = {'换一': -3.40282e+38, '首歌': -100, '换首歌': 3.40282e+38}
a<unk>
<space>A<space>爱'爱<unk>
<unk>
<space>A<space>
爱'爱<unk>
No. 7 and No. 9 hot words did not take effect. When the language model is_character_based is false, Words generated between two spaces should be in 1-grams or is a prefix of 1-grams. hotwords '换首歌' not in 1-grams.
second audio
set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space ),alpha =0.5 ,beta=0.5,window_length=4. hot_words = {'极点': 550}.Set the space to
<space>
before compiling ctc_decoder.<space><space><space><space>
<space>极点<space>
极点<space><space><space><space>
<space>是<space>是<space>是<space>