Slyne / ctc_decoder

A ctc decoder for both online and offline asr model
57 stars 26 forks source link

Fix language model repeated scoring #12

Open FieldsMedal opened 1 year ago

FieldsMedal commented 1 year ago

In this pr,fix language model score repeatedly. When hotwords_scorer->is_character_based and ext_scorer->is_character_based() is false,The language model and hot word scores will be repeatedly calculated. In fact, if the language model is word based , it will only call the scorer whenever space_id is detected. After modification, we tested all possibilities on the dataset.

first audio

set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space ),alpha =0.5 ,beta=0.5,window_length=4. hot_words = {'换一': -3.40282e+38, '首歌': -100, '换首歌': 3.40282e+38}

编号 模型 热词is_character_based 语言模型is_character_based 解码结果(best path)
1 都不使用 * * 换一首歌
2 热词 TRUE * 换首歌a<unk>
3 FALSE * 换首歌<space>A<space>爱'爱<unk>
4 语言 * TRUE 换一首歌
5 * FALSE 换一首
6 热词+语言 TRUE TRUE 换换首歌<unk>
8 FALSE TRUE 换首歌<space>A<space>爱'爱<unk>

No. 7 and No. 9 hot words did not take effect. When the language model is_character_based is false, Words generated between two spaces should be in 1-grams or is a prefix of 1-grams. hotwords '换首歌' not in 1-grams.

second audio

set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space ),alpha =0.5 ,beta=0.5,window_length=4. hot_words = {'极点': 550}.Set the space to <space>before compiling ctc_decoder.

编号 模型 热词is_character_based 语言模型is_character_based 解码结果(best path)
1 都不使用 * * 几点了
2 热词 TRUE * 极点极点点了
3 FALSE * 极点<space><space><space><space>
4 语言 * TRUE 几点啦
5 * FALSE 几点啦
6 热词+语言 TRUE TRUE 极点极点极点啦
7 TRUE FALSE 极点<space>极点<space>极点
8 FALSE TRUE 极点<space><space><space><space>
9 FALSE FALSE 极点<space>是<space>是<space>是<space>