Training a dataset with non-latin characters

deepgram / kur

Descriptive Deep Learning

Apache License 2.0

814 stars 107 forks source link

Training a dataset with non-latin characters #98

Closed cgrozev closed 5 years ago

cgrozev commented 6 years ago

Hi. I decided to use Kur to try and train a model based on a Russian-language corpus (about 8k transcribed input utterances). I had to increase the vocab parameter, but other than that, it started training. However I wonder if I need to anticipate further changes to account for the cyrillic alphabet? I just ran a test evaluation after one epoch, and the gibberish I got was in latin.

Any ideas of the changes required?

Regards Christo

scottstephenson commented 5 years ago

You'll need to change the vocab file and feed it training data that fits precisely with that vocab. Then it'll work just fine.