deepgram / kur

Descriptive Deep Learning
Apache License 2.0
814 stars 107 forks source link

error when using uppercase transcripts #35

Closed michaelcapizzi closed 7 years ago

michaelcapizzi commented 7 years ago

This is perhaps less of an issue as a "heads up" to others.

I have transcripts with all uppercase letters, but this seems to have caused the following error:

InvalidArgumentError (see above for traceback): Labels length is zero in batch 0
         [[Node: CTCLoss = CTCLoss[ctc_merge_repeated=true, preprocess_collapse_
repeated=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Log/_213, ToInt
64/_215, GatherNd, Squeeze_2/_217)]]

So it appears that unless I'm missing a configuration setting somewhere, all transcripts must be lowercase.

scottstephenson commented 7 years ago

Are you using an old vocab file by chance vocab.json?

The speech data supplier should infer your vocabulary when it loads your dataset. When it does this it creates a vocab.json in the current directory. I'm not able to test right this second but maybe @ajsyp can get back with a more certain response.

michaelcapizzi commented 7 years ago

It's quite possible @scottstephenson . I had a similar issue with a lingering norm.yml file after running your example.

Where would that vocab.yml file be?

scottstephenson commented 7 years ago

Sorry, I had a typo that I fixed in the original reply. It's actually vocab.json and it would be produced in the current working directory where you run $ kur train speech.yml. Let me test a bit before I say too much more :)

ajsyp commented 7 years ago

No, the vocab.json file can be manually created, in which case it should be a JSON list of strings, where each string is a word in the vocabulary (or letter, as the case may be). But by default, as in the example speech Kurfile, it is simply inferred, in-memory, from the training set. I believe that Kur casts the vocabulary to lowercase, but it should be casting the transcripts as well. Since it is not, this is a bug. I will fix it shortly.

ajsyp commented 7 years ago

This has been fixed in 9102b54.