Closed xinq2016 closed 7 years ago
penalty too much? try again, maybe
What penalty do you exactly mean? The optimizer of GRU net is assigned as following:
optimizer: name: sgd nesterov: yes learning_rate: 2e-4 momentum: 0.9 clip: norm: 100
Things to try:
train:
data:
- speech_recognition:
max_duration: 5
# other parameters ...
Also, are you using the provided speech.yml
Kurfile, as-is? And when you say that it becomes NaN after a few iterations, you're talking about a few batches during the very first epoch?
I use the ba-dls-deepspeech architecture provided by baidu actually, with 1500hrs mandarin train set.
But I check the speech.yml and ba-dls-deepspeech, they are same except the backend of the keras. ba-dls-deepspeech uses theano, but kurs uses tensorflow. The archtiecture we use is shown as below:
The Nan occurs during the very first epoch, in which the training set is sorted by length.
lerning rate is 2e-4, the output dim is 5742(include the blank label), the max duration is 15.0 seconds
I did some experiments for the learning rate:
Can anyone give some tips to decide the learning rate for the speech recognition training with ctc?
Many thanks
Yes, ba-dls uses Theano. Are you using Theano in your Kurfile? Could you post the Kurfile?
Also, the loss curves are normal for a "Sortagrad" epoch: if sortagrad: duration
is in your Kurfile, then during the first epoch, all training samples are presented to the network in the order of increasing length. Since longer utterances are harder to learn, the training loss tends to increase during the first epoch. For subsequent epochs, data will be shuffled per usual, and loss should start to go down again.
Sortagrad is useful as a form of regularization, or as a type of curriculum learning, to help ensure that the network learns smoothly during its first epoch and doesn't hit numerical instabilities
Not so good, trainning lose 30, validating 242
Epoch 18/inf, loss=33.752: 100%|██▉| 28528/28539 Validating, loss=224.911: 94%|█████████▍| 256/271 Prediction: "purt they motions of heairien as komros worer fior the monment thouh was tove thictry onunly" Truth: "but the emotions of harry and his comrades were for the moment those of victory only"
Epoch 19/inf, loss=30.674: 100%|██▉| 28528/28539 [INFO 2017-03-01 11:58:39,741 kur.model.executor:390] Training loss: 30.674 Validating, loss=242.753: 94%|█████████▍| 256/271 [INFO 2017-03-01 11:59:29,875 kur.model.executor:175] Validation loss: 242.753 Prediction: "qlim ob bat ing satltf ul an the martuy of paurtigra that an lon insheri ent were rase of help ploss an the vouxtt to slavholden's toos to autteresxtryrmanation efro thi semin guor the garr es of ouasaryos and whes ta m bon as thi lost chaiuns the the got ic reaces wit ot sosa is on the is tyer mrorel the saime stearin yet holson dusan pen under which the hestrn had the restort ti mih" Truth: "climate bad example and the luxury of power degraded them in one century into a race of helpless and debauched slave holders doomed to utter extermination before the semi gothic armies of belisarius and with them vanished the last chance that the gothic races would exercise on the eastern world the same stern yet wholesome discipline under which the western had been restored to life"
@liaoweiguo I'd say the result after epoch 18 isn't too bad. What training and validation datasets are you using?
@scottstephenson 9G lsc100-100p-train
@ajsyp
sorry, I did not use Kur yet. I can not download the data set of the training set in the example because of the internet, so the data format is unknown. Or can you tell me the format of the data set.
For quick verification of performance of CTC, comparing with the Hybrid System(Chain Model) from Kaldi, I choose ba-dls with the same network architecture as Kur instead. I believe that both of ba-dls and Kur have the same performance.
To avoid the NaN training loss in long utterances(may be caused by the Theano, especially in the first sorted epoch), now I limit the first sorted epoch training to 500 frames per utterance (about the first 20% of training set) with learning rate 8e-5. After the very first epoch, random samples without limitation for training with beginning learning rate 8e-5, final learning rate 1e-5, decay per epoch. Take the SGD as optimizer, with momentum always 0.9, clipnorm 100. optimizer = SGD(nesterov=True, lr=learning_rate, momentum=0.9, clipnorm=100) updates = optimizer.get_updates(trainable_vars, [], ctc_cost)
The train and valid loss curves are shown as below till now. Don't know whether the loss will go down or not, but not yet till now(in the epoch 1).
@liaoweiguo -- I agree with @scottstephenson your model seems to be training just fine. Given that you're on epoch 18, I assume you aren't suffering from the "loss is NaN" problem that this issue is about? If you feel that something else isn't working, feel free to open another issue.
@xinq2016: I'm a little confused. You said, "I did not use Kur yet." If you are not using Kur, then you should ask your more general design questions on our Gitter, rather than opening a GitHub issue on a non-Kur topic. If you are using Kur, then those plots suggest that you are training past the first epoch, rather than getting NaN loss in the first iterations.
So is there still an open issue here?
@ajsyp sorry for your confusion. I can not get the data set of the example of Kur. I use da-bls, but there is no one reply the NaN issue now in the da-bls github. May be the issue should be closed now. Thank you for your advice.
The data format is very similar to ba-dls. Your Kurfile should have a section that looks like this:
- speech_recognition:
path: PATH
Now, you should have a folder at PATH
that contains two items: an audio
directory and a dataset.jsonl
file. The dataset.jsonl
file should contain lines, each line of which is a JSON blob of the form:
{ 'text' : 'my transcript goes here', 'duration_s' : DURATION, 'uuid' : UUID }
The duration should be the audio duration in seconds, and the UUID can be any identifier (not necessarily a strict UUID). For each entry in the JSONL file, there should be a corresponding UUID.wav
file in the audio
directory. Also, your data doesn't necessarily need to be WAV format. We've tested on WAV, FLAC, and MP3.
If you think there is something wrong with the server that is hosting the speech datasets, please open a new issue.
@ajsyp Many thanks, I will try Kur to train my corpus.
Is there any one get the same issue? train/valid loss became NaN after few iters