Closed AdolfVonKleist closed 8 years ago
I've been trying to figure out the best way to incorporate this. The network currently on the main branch is what Baidu used on 8k hours of data, but it should generalise to the small corpus, albeit not as good as it could with some parameter tweaks.
I think specific network parameters to achieve the lowest WER/CER on librispeech and AN4 would be nice :)
The results you're getting are crazy! Is this with the latest branch of master? Doing this on the main branch levelled off around a WER around 16 for me.
Yes, this was the latest version from master, checked out yesterday. After setting everything up, I ran it with the default params, and got something similar to the posted report. I didn't really do anything except tweak those params a bit. This is the output of 'Train.lua'. I also ran 'Test.lua', but that also output pretty much the same numbers (WER rounded up to 4).
Is it possible I did something wrong? I also double checked that Test.lua is indeed referencing the test for and not anything in train.
Anyway this seemed to be the sweet spot. Lowering things more brought the error rate back up. These params were also quite fast - around 15 mins to run through 25 epochs on an ec2 g2 machine.
Strange I just ran th Train.lua -hiddenSize 750 -nbOfHiddenLayers 6
and this was my output graph:
Anything I'm missing? Thanks for your help on this one :)
It must be on my side, this is my first foray into this. I followed the tutorials for warp-ctc installation (with cuda), and then your wiki tutorial, including the AN4 data download and prep.
My complete Train.lua looks like this:
local Network = require 'Network'
-- Options can be overrided on command line run.
local cmd = torch.CmdLine()
cmd:option('-loadModel', false, 'Load previously saved model')
cmd:option('-saveModel', true, 'Save model after training/testing')
cmd:option('-modelName', 'DeepSpeechModel', 'Name of class containing architecture')
cmd:option('-nGPU', 1, 'Number of GPUs, set -1 to use CPU')
cmd:option('-trainingSetLMDBPath', './prepare_datasets/an4_lmdb/train/', 'Path to LMDB training dataset')
cmd:option('-validationSetLMDBPath', './prepare_datasets/an4_lmdb/test/', 'Path to LMDB test dataset')
cmd:option('-logsTrainPath', './logs/TrainingLoss/', ' Path to save Training logs')
cmd:option('-logsValidationPath', './logs/ValidationScores/', ' Path to save Validation logs')
cmd:option('-saveModelInTraining', true, 'save model periodically through training')
cmd:option('-modelTrainingPath', './models/', ' Path to save periodic training models')
cmd:option('-saveModelIterations', 10, 'When to save model through training')
cmd:option('-modelPath', 'deepspeech.t7', 'Path of final model to save/load')
cmd:option('-dictionaryPath', './dictionary', ' File containing the dictionary to use')
cmd:option('-epochs', 25, 'Number of epochs for training')
cmd:option('-learningRate', 3e-4, ' Training learning rate')
cmd:option('-learningRateAnnealing', 1.1, 'Factor to anneal lr every epoch')
cmd:option('-maxNorm', 400, 'Max norm used to normalize gradients')
cmd:option('-momentum', 0.90, 'Momentum for SGD')
cmd:option('-batchSize', 20, 'Batch size in training')
cmd:option('-validationBatchSize', 20, 'Batch size for validation')
cmd:option('-hiddenSize', 750, 'RNN hidden sizes')
cmd:option('-nbOfHiddenLayers', 6, 'Number of rnn layers')
local opt = cmd:parse(arg)
--Parameters for the stochastic gradient descent (using the optim library).
local optimParams = {
learningRate = opt.learningRate,
learningRateAnnealing = opt.learningRateAnnealing,
momentum = opt.momentum,
dampening = 0,
nesterov = true
}
--Create and train the network based on the parameters and training data.
Network:init(opt)
Network:trainNetwork(opt.epochs, optimParams)
--Creates the loss plot.
Network:createLossGraph()
I can also share the logs and the model that generated this result if it would be useful. I'm curious now what is going on, and where I am making a mistake.
Wow thanks for this there is a huge bug on the data pre-processing (the test set is the training set, which is why the WER/CER is so low. Fixing now.
Could you pull the main branch and re-process the an4 dataset? It should now be fixed!
Ah ok! I was really excited! I guess I should have double checked all the prep too. I will pull it an rerun.
That did it! Now I get something similar to your replication.
Awesome! I still believe that a smaller net would do better on AN4, I'll keep trying different things to get the WER/CER down that may stray from DS2!
Cheers, and thanks for the quick response. This is really an impressive reference. I'm familiar with the ideas and other toolkits, but had only touched Torch once or twice before; but I was able to get going in an hour or so. I'm sure I'll keep playing with it quite a bit as well. It would be really neat to find a way to get it near the Kaldi baseline, even if not with the exact same approach as DS2.
Thanks so much for your kind words, glad that the project is useful for you! Definitely, thanks for bringing those baseline numbers up, there is still a lot of improvement that can be done on the model, hopefully we get closer with time :)
This is slightly off topic but I wasn't sure where to ask: if I want to change the 'dictionary', for instance use phonemes or sub-word units of some other form, it should be sufficient to write a custom Mapper.lua
function which transliterates breaks the transcription up into the dictionary units, and reconstructs the words from the output IDs, correct?
So if I create dictionary items thi
, and stle
, the Mapper.lua
should know how to break thistle
into these component parts, and put them back together (where appropriate) when it receives hypotheses in the form of mapped ID sequences?
Have you seen the phoneme branch? There are currently issues with convergence on it but hopefully will give you an idea on how to implement your own dictionary!
Hi, Thanks for sharing this wonderful reference!
I have been playing a bit with the AN4 example, and I had a question/comment: the AN4 corpus is extremely small as far as speech corpora go, and I was wondering if maybe the default parameters are a bit over specified?
Turning them down a bit to:
'-hiddenSize', 750
'-nbOfHiddenLayers', 6
I end up with the following loss graph with final error rates:
these final results varied somewhat over 15 or so different trials, from a lower bound of WER: 2.5, CER: 0.95, to an upper bound of WER: 6.78, CER: 2.23, but were consistently a bit lower than the current baseline.
Do you think this is still a consequence of noise/seed or does this hypothesis make sense?
Interestingly, while we cannot directly compare them, it is really cool to see that these values are well inside the same ballpark as those currently reported in the Kaldi AN4 reference results: