TensorSpeech / TensorFlowASR

:zap: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords
https://huylenguyen.com/asr
Apache License 2.0
938 stars 245 forks source link

What I should do if I want to train a Japanese Model? #219

Open ymzlygw opened 3 years ago

ymzlygw commented 3 years ago

Hi, my question is that for english, the output of model is directly the index of char If I understand correctly,then it can map between char and sequence. And for japanese, what is the output of model? and how to create map between index and kanji of jp.

ymzlygw commented 3 years ago

I see the english_characters , what about japanese? And too get the japanese_characters, token_type using is 'char' or 'bpe'? ENGLISH_CHARACTERS = [a-z],

nglehuy commented 3 years ago

@ymzlygw I think for Japanese, Korean, Chinese we should use subwords instead of characters. If you can define a vocabulary contains all characters of the language like in english then you can use character mode. As far as I know those languages have characters that are a combination of "some characters in alphabet" so I think it's quite a lot for you to define a characters vocabulary file.

psyma commented 2 years ago

Hi, I tried to train a Chinese model and it seems not good, I followed the steps in Conformer the same way with English. can have a suggestion on how could I properly train a Chinese model? Thanks!