Speech Recognition Seems To Overfit

bharris47 commented 7 years ago

Hi, I don't know if this is an issue with the framework, but did not know where else to ask. I have been training the speech recognition example (speech.yml) for about 80 epochs on a Titan X within a tensorflow-gpu-py3 based docker image. For some reason, the training loss has gone way down, but my validation loss is still very high and the sample predictions it spits out are gibberish.

Example output:

Epoch 80/inf, loss=5.845: 100%|##########| 2432/2432 [07:04<00:00,  7.80samples/s]
Validating, loss=766.122:  94%|#########4| 256/271 [00:21<00:01, 13.70samples/s]
Prediction: "thg asi bw a hnta e tb incotnetk rndegibnrtrlan ty bmna ekftett trelaob"
Truth: "and what inquired missus macpherson has mary ann given you her love"

Is this sort of behavior expected this early in training?
How long would you expect to have to train this model to start getting reasonable results.

scottstephenson commented 7 years ago

You're right that it looks like overfitting (which isn't surprising on this dataset). However, it looks like it learned common spacing patterns, characters, and typical vowel to consonant ratios. That's a sign that things are on a good trajectory.

Output from previous epochs and getting ahold of your pip freeze will help us in this debug. Wanna send those?

Some things to try in the meantime while we sort out the example:

lower the learning rate
shorten the allowed file duration (change max_duration from 50 —> 15)

bharris47 commented 7 years ago

Good point about learning basic language constructs.

Here's the pip freeze and the stdout from training:

https://drive.google.com/open?id=0B7XAwHAx4HjRdV9yUllVV1NhQ2s https://drive.google.com/open?id=0B7XAwHAx4HjRRHJ3MmRKNGxrTEU

Thanks for the tips, I'll give them a try.

Semi-unrelated note: I tried using the LSTM type for RNN, but all the predictions are " ". I can file a separate issue for that if needed.

scottstephenson commented 7 years ago

It looks to me like your environment and results are ok.

One thing to keep in mind is that the speech recognition example is meant to show a working/training model, but is still pretty data limited. We're working on releasing a dataset that has considerably more data so people can get better results.

Feeding more data to the speech rec model takes more compute and bandwidth than a lot of people have access to, though, so we chose to keep the initial release dataset small for ease of use.

Nevertheless, the model is ready, and just needs data :). We'll keep you updated on the dataset release.

We'll get ~10x more data out in the next couple days.

scottstephenson commented 7 years ago

@bharris47 We just put up some bigger data sets. Can you point your train url and checksum to one of these?

URLS:
10 hour(0.9GB):  http://kur.deepgram.com/data/lsc100-10p-train.tar.gz
20 hour(1.8GB):  http://kur.deepgram.com/data/lsc100-20p-train.tar.gz
50 hour (4.5GB):  http://kur.deepgram.com/data/lsc100-50p-train.tar.gz
100 hour (8.9GB): http://kur.deepgram.com/data/lsc100-100p-train.tar.gz

CHECKSUMS:
10 hour:  46354f284588fec8facd2fc6abee6ba9d020e21bdcb26081d3990e96e720d8a6
20 hour:  e8075b10d3750e6532d10cfb2c00aa8047518b1046708fdcab14e4d1f56c499d
50 hour:  14a740c0ea1097b60e3d0df924f2e0175009a997cb9c9b55e27b28fe280afdc0
100 hour: cad3d2aa735d50d4ddb051fd8455f2dd7625ba0bb1c7dd1528da171a10f4fe86

The 10 hour set is double the size of the default dataset for the stock speech example. You can keep going up in scale and by the 50 hour mark, you're bound to start seeing pretty good output.

Note: The above datasets are fractions of the 100 hour dataset (100%=100p, 50%=50p, ...) from librispeech. They should not be concatenated because: 100p contains 50p which contains 20p which contains 10p.

bharris47 commented 7 years ago

Thanks a lot @scottstephenson, can confirm I am starting to see some intelligible predictions!

Epoch 6/inf, loss=214.989: 100%|#########9| 28416/28539 [25:19<00:06, 18.65samples/s]
[INFO 2017-01-23 19:21:05,447 kur.model.executor:338] Training loss: 214.989
Validating, loss=247.101:  94%|#########4| 256/271 [00:13<00:00, 17.32samples/s]
[INFO 2017-01-23 19:21:19,196 kur.model.executor:172] Validation loss: 247.101
Prediction: " itd mighthe jest is wowl bea s onto ese es whet d eid so on am arteien is the par wich h ime saeit tipla ne cq"
Truth: "it might just as well be some one else's wedding so unimportant is the part which i am set to play in it"

ajsyp commented 7 years ago

Glad to hear! Feel free to use our data format as a template for adding even more data! It's just a simple tarball.

deepgram / kur

Speech Recognition Seems To Overfit #6