Closed vinh22032000 closed 10 months ago
It has been a while so my memory is blurry. But the point is that the ASR model is trained with Librispeech so librispeech lexicon should be used. These code seems to removing the suffix, _B, _E seem to mean the beginning of the word or end of the word.
-Yuan
Thanks you
Hi Yuan, When doing the inference steps of your tutorial, i find your suggestion that i should replace the content of lexicon.txt in speechocean762 with the librispeech-lexicon.tx, and the example code of cleaning:
with open("librispeech-lexicon.txt", 'r') as f: lexicon_raw = f.read() rows = lexicon_raw.splitlines() clean_rows = [row.split() for row in rows] lexicon_dict_l = dict() for row in clean_rows: c_row = row.copy() key = c_row.pop(0) if len(c_row) == 1: c_row[0] = c_row[0] + '_S' if len(c_row) >= 2: c_row[0] = c_row[0] + '_B' c_row[-1] = c_row[-1] + '_E' if len(c_row) > 2: for i in range(1,len(c_row)-1): c_row[i] = c_row[i] + '_I' val = " ".join(c_row) lexicon_dict_l[key] = val lexicon_dict_l
Can you please explain me why we need to clean the lexicon and what is the meaning of suffix _S,_B,_E? Thanks you.