NervanaSystems / deepspeech

DeepSpeech neon implementation
Apache License 2.0
222 stars 69 forks source link

error in loader: alphabet does not consist of unique chars #32

Closed anonymous-u closed 7 years ago

anonymous-u commented 7 years ago

I am trying to train my own model and I need to change the alphabet in the train.py file. When I change the alphabet by russian cyrillic chars, I get the following error.

Traceback (most recent call last): File "deepspeech/speech/train.py", line 133, in train = DataLoader(backend=be, config=train_cfg_dict) File "home/ubuntu/neon/.venv2/local/lib/python2.7/site-packages/aeon/dataloader.py", line 66, in init self.loader = self._start(json.dumps(config), backend) File "/home/ubuntu/neon/.venv2/local/lib/python2.7/site-packages/aeon/dataloader.py", line 123, in _start self._raise_loader_error() File "/home/ubuntu/neon/.venv2/local/lib/python2.7/site-packages/aeon/dataloader.py", line 106, in _raise_loader_error self.loaderlib.get_errormessage() aeon.dataloader.LoaderRuntimeError: error in loader: alphabet does not consist of unique chars 'АБВГДЕЁЖЗИЙКЛМНОӨПРСТУҮФХЦЧШЩЪЬЫЭЮЯ

When I print the config variable in the _start() function of aeon/dataloader.py, it prints :

{"macrobatch_size": 1, "manifest_filename": "/home/ubuntu/myfolder/train2.csv", "transcription": {"alphabet": "'\u0410\u0411\u0412\u0413\u0414\u0415\u0401\u0416\u0417\u0418\u0419\u041a\u041b\u041c\u041d\u041e\u04e8\u041f\u0420\u0421\u0422\u0423\u04ae\u0424\u0425\u0426\u0427\u0428\u0429\u042a\u042c\u042b\u042d\u042e\u042f ", "pack_for_ctc": true, "max_length": 1300}, "audio": {"max_duration": "30 seconds", "frame_stride": ".01 seconds", "num_filters": 13, "sample_freq_hz": 16000, "frame_length": ".025 seconds", "feature_type": "mfsc"}, "type": "audio,transcription", "minibatch_size": 1}

I think it doesn't recognize \u chars. So what should I need to do?

pankaj2701 commented 7 years ago

I faced the same issue while training a model for hindi and I am trying to do phoneme classification. What I have done is that I have mapped all the characters to some character in English alphabet. Converted all the training transcripts as per that map. FInally while decoding I reconvert the results back to the original character set,

tyler-nervana commented 7 years ago

Closing since this issue is a duplicate of https://github.com/NervanaSystems/aeon/issues/51.