Element-Research / rnn

Recurrent Neural Network library for Torch7's nn
BSD 3-Clause "New" or "Revised" License
941 stars 313 forks source link

LSTM decorated with Sequencer null error after some batches of video data #298

Closed felixzfx closed 8 years ago

felixzfx commented 8 years ago

Hi guys,

I modified the example of recurrent-language-model in rnn/examples/recurrent-language-model.lua to handle video data. In the example, the FastLSTM module is decorated with a Sequencer module. The input of my experiments is seqlength x batchSize x featureSize, where seqLength is set to 16 frames, batchSize is set to 5, and featureSize= channels x height x width of the images. All code run in CPU mode.

Every time in my experiments, the program ran for the first epoch and failed near some batch of data. The error is: **null Stack trace:

0 file:/home/fanxiang/workspace/myLSTMActionRecog/src/Linear.lua [67]

#1 file:/home/fanxiang/torch/install/share/lua/5.1/nngraph/gmodule.lua [333]
#2 file:/home/fanxiang/torch/install/share/lua/5.1/nngraph/gmodule.lua [368]
#3 file:/home/fanxiang/workspace/myLSTMActionRecog/src/LSTM.lua [167]
#4 file:/ [-1]
#5 file:/home/fanxiang/torch/install/share/lua/5.1/nn/Container.lua [63]
#6 file:/home/fanxiang/torch/install/share/lua/5.1/nn/Sequential.lua [44]
#7 file:/home/fanxiang/torch/install/share/lua/5.1/rnn/Recursor.lua [25]
#8 file:/home/fanxiang/torch/install/share/lua/5.1/rnn/Sequencer.lua [59]
#9 file:/ [-1]
#10 file:/home/fanxiang/torch/install/share/lua/5.1/nn/Container.lua [63]
#11 file:/home/fanxiang/torch/install/share/lua/5.1/nn/Sequential.lua [44]
#12 file:/home/fanxiang/workspace/myLSTMActionRecog/src/testLstmModule.lua [264]**

And the locations where the error occurred are not the same but very close. Almost every time, the error occurred in nn.Linear module. I marked the locations where the error occured nn/Linear.lua:

function Linear:updateOutput(input) if input:dim() == 1 then self.output:resize(self.weight:size(1)) if self.bias then self.output:copy(self.bias) else self.output:zero() end self.output:addmv(1, self.weight, input) elseif input:dim() == 2 then local nframe = input:size(1) local nElement = self.output:nElement() self.output:resize(nframe, self.weight:size(1)) if self.output:nElement() ~= nElement then self.output:zero() end updateAddBuffer(self, input) self.output:addmm(0, self.output, 1, input, self.weight:t()) if self.bias then self.output:addr(1, self.addBuffer, self.bias) end --Sometimes Here else error('input must be vector or matrix') end

return self.output end

function Linear:updateGradInput(input, gradOutput) if self.gradInput then local nElement = self.gradInput:nElement() self.gradInput:resizeAs(input) if self.gradInput:nElement() ~= nElement then self.gradInput:zero() end if input:dim() == 1 then self.gradInput:addmv(0, 1, self.weight:t(), gradOutput) elseif input:dim() == 2 then self.gradInput:addmm(0, 1, gradOutput, self.weight) --sometimes here end return self.gradInput end end

function Linear:accGradParameters(input, gradOutput, scale) scale = scale or 1 if input:dim() == 1 then self.gradWeight:addr(scale, gradOutput, input) if self.bias then self.gradBias:add(scale, gradOutput) end elseif input:dim() == 2 then self.gradWeight:addmm(scale, gradOutput:t(), input) if self.bias then -- and sometimes here -- update the size of addBuffer if the input is not the same size as the one we had in last updateGradInput updateAddBuffer(self, input) self.gradBias:addmv(scale, gradOutput:t(), self.addBuffer) end end end

I met the nil error often but hardly meet this null error. To find out the reason, i write some code to simulate loading the video frame data. I got the same error as i use the true video data. Here is the full code

--testing the lstm model based on nn.FastLSTM but using simulated data require 'paths' require 'rnn' version = 2

--[[ command line arguments ]]-- cmd = torch.CmdLine() cmd:text() cmd:text('Train a Language Model on PennTreeBank dataset using RNN or LSTM or GRU') cmd:text('Example:') cmd:text('th recurrent-language-model.lua --cuda --device 2 --progress --cutoff 4 --seqlen 10') cmd:text("th recurrent-language-model.lua --progress --cuda --lstm --seqlen 20 --hiddensize '{200,200}' --batchsize 20 --startlr 1 --cutoff 5 --maxepoch 13 --schedule '{[5]=0.5,[6]=0.25,[7]=0.125,[8]=0.0625,[9]=0.03125,[10]=0.015625,[11]=0.0078125,[12]=0.00390625}'") cmd:text("th recurrent-language-model.lua --progress --cuda --lstm --seqlen 35 --uniform 0.04 --hiddensize '{1500,1500}' --batchsize 20 --startlr 1 --cutoff 10 --maxepoch 50 --schedule '{[15]=0.87,[16]=0.76,[17]=0.66,[18]=0.54,[19]=0.43,[20]=0.32,[21]=0.21,[22]=0.10}' -dropout 0.65") cmd:text('Options:') --dataloader cmd:option('--numClasses', 3) -- necessary cmd:option('--scaledHeight', 192) -- video frame height cmd:option('--scaledWidth', 208) -- video frame width cmd:option('--numChannels', 3) --num of channels -- training cmd:option('--startlr', 0.0001, 'learning rate at t=0') cmd:option('--minlr', 0.000001, 'minimum learning rate') cmd:option('--saturate', 50, 'epoch at which linear decayed LR will reach minlr') cmd:option('--schedule', '', 'learning rate schedule. e.g. {[5] = 0.004, [6] = 0.001}') cmd:option('--momentum', 0.9, 'momentum') cmd:option('--maxnormout', -1, 'max l2-norm of each layer\'s output neuron weights') cmd:option('--cutoff', -1, 'max l2-norm of concatenation of all gradParam tensors') cmd:option('--batchSize', 5, 'number of examples per batch') cmd:option('--cuda', false, 'use CUDA') cmd:option('--device', 1, 'sets the device (GPU) to use') cmd:option('--maxepoch', 10, 'maximum number of epochs to run') ---1000 by default cmd:option('--earlystop', 50, 'maximum number of epochs to wait to find a better local minima for early-stopping') cmd:option('--progress', false, 'print progress bar') cmd:option('--silent', false, 'don\'t print anything to stdout') cmd:option('--uniform', 0.1, 'initialize parameters using uniform distribution between -uniform and uniform. -1 means default initialization') -- rnn layer cmd:option('--lstm', true, 'use Long Short Term Memory (nn.LSTM instead of nn.Recurrent)') cmd:option('--bn', false, 'use batch normalization. Only supported with --lstm') cmd:option('--gru', false, 'use Gated Recurrent Units (nn.GRU instead of nn.Recurrent)') cmd:option('--seqLength', 16, 'sequence length : back-propagate through time (BPTT) for this many time-steps') cmd:option('--inputsize', -1, 'size of lookup table embeddings. -1 defaults to hiddensize[1]') cmd:option('--hiddensize', '{256}', 'number of hidden units used at output of each recurrent layer. When more than one is specified, RNN/LSTMs/GRUs are stacked') cmd:option('--dropout', 0, 'apply dropout with this probability after each rnn layer. dropout <= 0 disables it.') -- data cmd:option('--batchsize', 5, 'number of examples per batch') cmd:option('--trainsize', -1, 'number of train examples seen between each epoch') cmd:option('--validsize', -1, 'number of valid examples used for early stopping and cross-validation') cmd:option('--savepath', paths.concat('/home/path-to-save-the-model', 'rnnlm'), 'path to directory where experiment log (includes model) will be saved') cmd:option('--id', 'simulateVideoProcessing', 'id string of this experiment (used to name output file) (defaults to a unique id)')

cmd:text() local opt = cmd:parse(arg or {}) opt.hiddensize = loadstring(" return "..opt.hiddensize)() opt.schedule = loadstring(" return "..opt.schedule)() opt.inputsize = opt.numChannels_opt.scaledHeight_opt.scaledWidth--opt.inputsize == -1 and opt.hiddensize[1] or opt.inputsize if not opt.silent then print(opt) end

if opt.cuda then require 'cunn' cutorch.setDevice(opt.device) end

--[[lstm model based on FastLSTM and Sequencer--]] local lm = nn.Sequential() lm:add(nn.View(-1,opt.batchSize,opt.numChannels_opt.scaledHeight_opt.scaledWidth)) lm:add(nn.SplitTable(1)) -- tensor to table of tensors -- rnn layers local stepmodule = nn.Sequential() -- applied at each time-step local inputsize = opt.inputsize for i,hiddensize in ipairs(opt.hiddensize) do local rnn if opt.gru then -- Gated Recurrent Units rnn = nn.GRU(inputsize, hiddensize, nil, opt.dropout/2) elseif opt.lstm then -- Long Short Term Memory units require 'nngraph' nn.FastLSTM.usenngraph = true -- faster nn.FastLSTM.bn = opt.bn rnn = nn.FastLSTM(inputsize, hiddensize) else -- simple recurrent neural network local rm = nn.Sequential() -- input is {x[t], h[t-1]} :add(nn.ParallelTable() :add(i==1 and nn.Identity() or nn.Linear(inputsize, hiddensize)) -- input layer :add(nn.Linear(hiddensize, hiddensize))) -- recurrent layer :add(nn.CAddTable()) -- merge :add(nn.Sigmoid()) -- transfer rnn = nn.Recurrence(rm, hiddensize, 1) end stepmodule:add(rnn) if opt.dropout > 0 then stepmodule:add(nn.Dropout(opt.dropout)) end
inputsize = hiddensize end -- output layer stepmodule:add(nn.Linear(inputsize, opt.numClasses)) stepmodule:add(nn.LogSoftMax()) -- encapsulate stepmodule into a Sequencer lm:add(nn.Sequencer(stepmodule)) -- remember previous state between batches lm:remember((opt.lstm or opt.gru) and 'both' or 'eval') if not opt.silent then print"Language Model:" print(lm) end if opt.uniform > 0 then for k,param in ipairs(lm:parameters()) do param:uniform(-opt.uniform, opt.uniform) end end

--[[simulated data--]] local simulated_Data = {} if paths.filep('tdata.t7') then simulated_Data.data = torch.load('tdata.t7') else simulated_Data.data = 20_torch.rand(opt.seqLength,opt.batchSize, opt.numChannels,opt.scaledHeight,opt.scaledWidth) torch.save('tdata.t7',simulated_Data.data) end simulated_Data.tindex = 1 local maxBatchTrain=175 --simulate training data function getTrainData(opt) local batch = {} if simulated_Data.tindex <= maxBatchTrain then batch.data = simulated_Data.data + (simulated_Data.tindex % 20)_0.01 local labels = torch.Tensor(opt.seqLength_opt.batchSize):fill(0) for i=1,labels:size(1) do labels[i]= (simulated_Data.tindex+i_i) % opt.numClasses +1 end batch.labels = labels:clone():view(opt.seqLength,opt.batchSize) simulated_Data.tindex = simulated_Data.tindex+1 return batch else simulated_Data.tindex = 1 return nil end
end

local maxBatchVal=30 simulated_Data.vindex = 1 --simulate validate data function getValData(opt) local batch = {} if simulated_Data.vindex <= maxBatchVal then batch.data = simulated_Data.data + (simulated_Data.vindex % 10)_0.01 local labels = torch.Tensor(opt.seqLength_opt.batchSize):fill(0) for i=1,labels:size(1) do labels[i]= (simulated_Data.vindex+i*i) % opt.numClasses +1 end batch.labels = labels:clone():view(opt.seqLength,opt.batchSize) simulated_Data.vindex = simulated_Data.vindex+1 return batch else simulated_Data.vindex = 1 return nil end end --[[simulate data end--]]

--[[ loss function ]]-- local crit = nn.ClassNLLCriterion() -- target is also seqlen x batchsize. local targetmodule = nn.SplitTable(1) if opt.cuda then targetmodule = nn.Sequential() :add(nn.Convert()) :add(targetmodule) end local criterion = nn.SequencerCriterion(crit)

--[[ CUDA ]]-- if opt.cuda then lm:cuda() criterion:cuda() targetmodule:cuda() end

--[[ experiment log ]]-- -- is saved to file every time a new validation minima is found local xplog = {} xplog.opt = opt -- save all hyper-parameters and such xplog.dataset = 'simulated' xplog.model = nn.Serial(lm) xplog.model:mediumSerial() xplog.criterion = criterion xplog.targetmodule = targetmodule -- keep a log of NLL for each epoch xplog.trainppl = {} xplog.valppl = {} -- will be used for early-stopping xplog.minvalppl = 99999999 xplog.epoch = 0 local ntrial = 0 paths.mkdir(opt.savepath)

local epoch = 1 opt.lr = opt.startlr opt.trainsize = 875 --simulate 175 batches with batch size 5 opt.validsize = 150 --simulate 30 batches with size 5 local params, gradParams = lm:getParameters() while opt.maxepoch <= 0 or epoch <= opt.maxepoch do print("") print("Epoch #"..epoch.." :")

-- 1. training local a = torch.Timer() lm:training() local sumErr = 0

local batch = getTrainData(opt) local i = 1 while batch ~= nil do print(string.format('%d\'th batch data min is %.08f max is %.08f',i,torch.min(batch.data),torch.max(batch.data))) local targets = targetmodule:forward(batch.labels) local inputs = batch.data
-- forward local outputs = lm:forward(inputs)--size is --print(outputs:dim())
local err = criterion:forward(outputs, targets) sumErr = sumErr + err
-- backward local gradOutputs = criterion:backward(outputs, targets) lm:zeroGradParameters() lm:backward(inputs, gradOutputs) -- update if opt.cutoff > 0 then local norm = lm:gradParamClip(opt.cutoff) -- affects gradParams opt.meanNorm = opt.meanNorm and (opt.meanNorm_0.9 + norm_0.1) or norm end lm:updateGradParameters(opt.momentum) -- affects gradParams --from dpnn lm:updateParameters(opt.lr) -- affects params lm:maxParamNorm(opt.maxnormout) -- affects params

  if opt.progress then
     xlua.progress(math.min(i + opt.seqLength, opt.trainsize), opt.trainsize)
  end
  --debug
  print(string.format('max of params is %.10f, min of params is %.10f,max grad is %.10f min grad is %.10f',
  torch.max(params),torch.min(params),torch.max(gradParams),torch.min(gradParams)))

  if i % 100 == 0 then
     collectgarbage()
  end

  batch = getTrainData(opt)
  i = i+1

end

-- learning rate decay if opt.schedule then opt.lr = opt.schedule[epoch] or opt.lr else opt.lr = opt.lr + (opt.minlr - opt.startlr)/opt.saturate end opt.lr = math.max(opt.minlr, opt.lr)

if not opt.silent then print("learning rate", opt.lr) if opt.meanNorm then print("mean gradParam norm", opt.meanNorm) end end

if cutorch then cutorch.synchronize() end local speed = a:time().real/opt.trainsize print(string.format("Speed : %f sec/batch ", speed))

local ppl = torch.exp(sumErr/opt.trainsize) print("Training PPL : "..ppl) xplog.trainppl[epoch] = ppl

-- 2. cross-validation lm:evaluate() local sumErr = 0 local vBatch = getValData(opt) while vBatch~=nil do local targets = targetmodule:forward(vBatch.labels) local inputs = vBatch.data local outputs = lm:forward(inputs) local err = criterion:forward(outputs, targets) sumErr = sumErr + err

  vBatch =  getValData(opt)

end

local ppl = torch.exp(sumErr/opt.validsize) -- Note : -- Perplexity = exp( sum ( NLL ) / #w) -- Bits Per Word = log2(Perplexity) -- Bits per Char = BPW * (#w / #c) print("Validation PPL : "..ppl)

xplog.valppl[epoch] = ppl ntrial = ntrial + 1

-- early-stopping if ppl < xplog.minvalppl then -- save best version of model xplog.minvalppl = ppl xplog.epoch = epoch local filename = paths.concat(opt.savepath, opt.id..'.t7') print("Found new minima. Saving to "..filename) torch.save(filename, xplog) ntrial = 0 elseif ntrial >= opt.earlystop then print("No new minima found after "..ntrial.." epochs.") print("Stopping experiment.") break end

collectgarbage() epoch = epoch + 1 end print("Evaluate model using : ") print("th scripts/evaluate-rnnlm.lua --xplogpath "..paths.concat(opt.savepath, opt.id..'.t7')..(opt.cuda and ' --cuda' or ''))

I have spent quite a long time to fix out this but still failed. Can someone help me figure this problem? THanks a lot

felixzfx commented 8 years ago

sorry guys, this error may be caused by my Eclipse LDT tool. I changed to Zerobrane studio, then this error is gone. my god!!

nicholas-leonard commented 8 years ago

@felixzfx Wow that was quite the error :)