Closed michaelcapizzi closed 7 years ago
Are you using an old vocab file by chance vocab.json
?
The speech data supplier should infer your vocabulary when it loads your dataset. When it does this it creates a vocab.json
in the current directory. I'm not able to test right this second but maybe @ajsyp can get back with a more certain response.
It's quite possible @scottstephenson . I had a similar issue with a lingering norm.yml
file after running your example.
Where would that vocab.yml
file be?
Sorry, I had a typo that I fixed in the original reply. It's actually vocab.json
and it would be produced in the current working directory where you run $ kur train speech.yml
. Let me test a bit before I say too much more :)
No, the vocab.json
file can be manually created, in which case it should be a JSON list of strings, where each string is a word in the vocabulary (or letter, as the case may be). But by default, as in the example speech Kurfile, it is simply inferred, in-memory, from the training set. I believe that Kur casts the vocabulary to lowercase, but it should be casting the transcripts as well. Since it is not, this is a bug. I will fix it shortly.
This has been fixed in 9102b54.
This is perhaps less of an issue as a "heads up" to others.
I have transcripts with all uppercase letters, but this seems to have caused the following error:
So it appears that unless I'm missing a configuration setting somewhere, all transcripts must be lowercase.