deepgram / kur

Descriptive Deep Learning
Apache License 2.0
814 stars 107 forks source link

Truth Data generation Error #103

Closed alamnasim closed 5 years ago

alamnasim commented 5 years ago

I am giving my own data to kur, but it generates truth data incorrectly having sentences with no meaning words, look like words with random chars.

One more problem, when i run kur on lsdc data(deleted provider section from train) with single gpu, it runs each epoch at 04:46<00:00, 8.37samples/s. Is it ok with this speed or It is not using gpu.

scottstephenson commented 5 years ago

Did you delete your vocab file and have it automatically regenerated?

What kind of GPU do you have? 8 samps/s is fairly fast. That's a ~50x speedup compared to realtime (seconds per second).

alamnasim commented 5 years ago

Thanks for quick reply... I am using Nvidia Tesla P100 GPU. I am not able to locate vocab file. Are you talking about alphabet file named vocab.py or the two .txt books pride and prejudice and Shakespear?

scottstephenson commented 5 years ago

Is this for the speech example or the language model example?

If it's for the speech example, then there is a vocab.json file that is automatically generated upon first training it. If you change anything around (like which input data you use) then you'll need to regenerate that vocab file (it's as simple as deleting it and letting it be regenerated on the next 'train').

alamnasim commented 5 years ago

I am working on speech example(speech recognition). Yesterday i run the code with my own dataset. I got ( Truth: " sgd o ' b b ' b o ' b o ' g'h m' gh ") this result while my truths are a proper sentence(Hindi-English mixed written in roman). It should fetch proper sentences (text) given in train.jsonl as truth. So, i am not able to figure out why this happens.

As you told, to delete vocab.json but i am unable to find (locate) this file in whole kur directory. could you please give me the exact path of this file.

scottstephenson commented 5 years ago

See #35 and #93

The vocab.json gets dumped in the current working directory when running kur train speech.yml.