loss is NaN for speech example

xinq2016 commented 7 years ago

Is there any one get the same issue? train/valid loss became NaN after few iters

liaoweiguo commented 7 years ago

penalty too much? try again, maybe

xinq2016 commented 7 years ago

What penalty do you exactly mean? The optimizer of GRU net is assigned as following:

optimizer: name: sgd nesterov: yes learning_rate: 2e-4 momentum: 0.9 clip: norm: 100

ajsyp commented 7 years ago

Things to try:

Lower the learning rate.
Use TensorFlow as the backend (Theano seems to NaN more readily than TF)
Try training on utterances that are shorter (1-5 seconds) in batches close to the same length. One easy way to do this is to set a maximum audio duration like this:
```
train:
data:
- speech_recognition:
    max_duration: 5
    # other parameters ...
```

ajsyp commented 7 years ago

Also, are you using the provided speech.yml Kurfile, as-is? And when you say that it becomes NaN after a few iterations, you're talking about a few batches during the very first epoch?

xinq2016 commented 7 years ago

I use the ba-dls-deepspeech architecture provided by baidu actually, with 1500hrs mandarin train set.

But I check the speech.yml and ba-dls-deepspeech, they are same except the backend of the keras. ba-dls-deepspeech uses theano, but kurs uses tensorflow. The archtiecture we use is shown as below:

The Nan occurs during the very first epoch, in which the training set is sorted by length.

lerning rate is 2e-4, the output dim is 5742(include the blank label), the max duration is 15.0 seconds

I did some experiments for the learning rate:

change the learning rate from 2e-4 to 8e-5, till the first 10000 iters (52620 iters per epoch), the train loss is shown as below. the train loss raise with the duration of the utterances. Does it mean that there is no improvement with the train?
with the learning rate 1e-5, the training loss is shown as below: (but it seems too small for training) It seems the small learning rate works for NaN.

Can anyone give some tips to decide the learning rate for the speech recognition training with ctc?

Many thanks

ajsyp commented 7 years ago

Yes, ba-dls uses Theano. Are you using Theano in your Kurfile? Could you post the Kurfile?

ajsyp commented 7 years ago

Also, the loss curves are normal for a "Sortagrad" epoch: if sortagrad: duration is in your Kurfile, then during the first epoch, all training samples are presented to the network in the order of increasing length. Since longer utterances are harder to learn, the training loss tends to increase during the first epoch. For subsequent epochs, data will be shuffled per usual, and loss should start to go down again.

Sortagrad is useful as a form of regularization, or as a type of curriculum learning, to help ensure that the network learns smoothly during its first epoch and doesn't hit numerical instabilities

liaoweiguo commented 7 years ago

Not so good, trainning lose 30, validating 242

Epoch 18/inf, loss=33.752: 100%|██▉| 28528/28539 Validating, loss=224.911: 94%|█████████▍| 256/271 Prediction: "purt they motions of heairien as komros worer fior the monment thouh was tove thictry onunly" Truth: "but the emotions of harry and his comrades were for the moment those of victory only"

Epoch 19/inf, loss=30.674: 100%|██▉| 28528/28539 [INFO 2017-03-01 11:58:39,741 kur.model.executor:390] Training loss: 30.674 Validating, loss=242.753: 94%|█████████▍| 256/271 [INFO 2017-03-01 11:59:29,875 kur.model.executor:175] Validation loss: 242.753 Prediction: "qlim ob bat ing satltf ul an the martuy of paurtigra that an lon insheri ent were rase of help ploss an the vouxtt to slavholden's toos to autteresxtryrmanation efro thi semin guor the garr es of ouasaryos and whes ta m bon as thi lost chaiuns the the got ic reaces wit ot sosa is on the is tyer mrorel the saime stearin yet holson dusan pen under which the hestrn had the restort ti mih" Truth: "climate bad example and the luxury of power degraded them in one century into a race of helpless and debauched slave holders doomed to utter extermination before the semi gothic armies of belisarius and with them vanished the last chance that the gothic races would exercise on the eastern world the same stern yet wholesome discipline under which the western had been restored to life"

scottstephenson commented 7 years ago

@liaoweiguo I'd say the result after epoch 18 isn't too bad. What training and validation datasets are you using?

liaoweiguo commented 7 years ago

@scottstephenson 9G lsc100-100p-train

xinq2016 commented 7 years ago

@ajsyp

sorry, I did not use Kur yet. I can not download the data set of the training set in the example because of the internet, so the data format is unknown. Or can you tell me the format of the data set.

For quick verification of performance of CTC, comparing with the Hybrid System(Chain Model) from Kaldi, I choose ba-dls with the same network architecture as Kur instead. I believe that both of ba-dls and Kur have the same performance.

To avoid the NaN training loss in long utterances(may be caused by the Theano, especially in the first sorted epoch), now I limit the first sorted epoch training to 500 frames per utterance (about the first 20% of training set) with learning rate 8e-5. After the very first epoch, random samples without limitation for training with beginning learning rate 8e-5, final learning rate 1e-5, decay per epoch. Take the SGD as optimizer, with momentum always 0.9, clipnorm 100. optimizer = SGD(nesterov=True, lr=learning_rate, momentum=0.9, clipnorm=100) updates = optimizer.get_updates(trainable_vars, [], ctc_cost)

The train and valid loss curves are shown as below till now. Don't know whether the loss will go down or not, but not yet till now(in the epoch 1).

ajsyp commented 7 years ago

@liaoweiguo -- I agree with @scottstephenson your model seems to be training just fine. Given that you're on epoch 18, I assume you aren't suffering from the "loss is NaN" problem that this issue is about? If you feel that something else isn't working, feel free to open another issue.

ajsyp commented 7 years ago

@xinq2016: I'm a little confused. You said, "I did not use Kur yet." If you are not using Kur, then you should ask your more general design questions on our Gitter, rather than opening a GitHub issue on a non-Kur topic. If you are using Kur, then those plots suggest that you are training past the first epoch, rather than getting NaN loss in the first iterations.

So is there still an open issue here?

xinq2016 commented 7 years ago

@ajsyp sorry for your confusion. I can not get the data set of the example of Kur. I use da-bls, but there is no one reply the NaN issue now in the da-bls github. May be the issue should be closed now. Thank you for your advice.

ajsyp commented 7 years ago

The data format is very similar to ba-dls. Your Kurfile should have a section that looks like this:

  - speech_recognition:
      path: PATH

Now, you should have a folder at PATH that contains two items: an audio directory and a dataset.jsonl file. The dataset.jsonl file should contain lines, each line of which is a JSON blob of the form:

{ 'text' : 'my transcript goes here', 'duration_s' : DURATION, 'uuid' : UUID }

The duration should be the audio duration in seconds, and the UUID can be any identifier (not necessarily a strict UUID). For each entry in the JSONL file, there should be a corresponding UUID.wav file in the audio directory. Also, your data doesn't necessarily need to be WAV format. We've tested on WAV, FLAC, and MP3.

If you think there is something wrong with the server that is hosting the speech datasets, please open a new issue.

xinq2016 commented 7 years ago

@ajsyp Many thanks, I will try Kur to train my corpus.

deepgram / kur

loss is NaN for speech example #21