Open ymzlygw opened 3 years ago
I see the english_characters , what about japanese? And too get the japanese_characters, token_type using is 'char' or 'bpe'? ENGLISH_CHARACTERS = [a-z],
@ymzlygw I think for Japanese, Korean, Chinese we should use subwords instead of characters. If you can define a vocabulary contains all characters of the language like in english then you can use character mode. As far as I know those languages have characters that are a combination of "some characters in alphabet" so I think it's quite a lot for you to define a characters vocabulary file.
Hi, I tried to train a Chinese model and it seems not good, I followed the steps in Conformer the same way with English. can have a suggestion on how could I properly train a Chinese model? Thanks!
Hi, my question is that for english, the output of model is directly the index of char If I understand correctly,then it can map between char and sequence. And for japanese, what is the output of model? and how to create map between index and kanji of jp.