Feedback request on "Simple LSTM" code / tutorial

Element-Research / rnn

Recurrent Neural Network library for Torch7's nn

BSD 3-Clause "New" or "Revised" License

941 stars 313 forks source link

Feedback request on "Simple LSTM" code / tutorial #92

Open hsheil opened 8 years ago

hsheil commented 8 years ago

I'm trying to document the simplest LSTM example possible. By "simplest", I mean the intersection of least lines of code combined with minimal use of libraries that hide away details. Once this simple case is working well, then the plan is to add in libraries like dp to show what that library provides, i.e. dp.Experiment. And to do this in a progressive way so that the reader can roll forwards / backwards between versions of the code to build understanding. So:

The code in the gist below works (loss reduces as the model trains), but with some caveats which I've documented as inline comments, numbered I0 through I6. I4 is the biggest issue i'd like to resolve..

I'd appreciate any feedback on the specific items in the code (esp. I4) and also any other comments on the code! When this code is good enough, the plan is to publish it as a tutorial of sorts on the rnn package with worked examples that progressively become more complex / advanced - feeding into other issue requests in this repo for more examples. Thanks in advance!

Here's the gist: https://gist.github.com/hsheil/54c0e83d4666db5081df

nicholas-leonard commented 8 years ago

Hi @hsheil. I like where this is going. I am reproducing your code here :

require 'rnn'

function build_network(inputSize, hiddenSize, outputSize)
   -- I1: add in a dropout layer
   rnn = nn.Sequential() 
   :add(nn.Sequencer(nn.Linear(inputSize, hiddenSize))) 
   :add(nn.Sequencer(nn.LSTM(hiddenSize, hiddenSize)))
   :add(nn.Sequencer(nn.LSTM(hiddenSize, hiddenSize))) 
   :add(nn.Sequencer(nn.Linear(hiddenSize, outputSize))) 
   :add(nn.Sequencer(nn.LogSoftMax()))
   -- I1: Adding this line makes the loss oscillate a lot more during training, when according to 
   -- http://arxiv.org/abs/1409.2329 this should *help* model performance 
  -- A1: initialization often depends on each dataset. 
   --rnn:getParameters():uniform(-0.1, 0.1)
   return rnn
end

-- Keep the input layer small so the model trains / converges quickly while training
local inputSize = 10
-- Most models seem to use 512 LSTM units in the hidden layers, so let's stick with this
local hiddenSize = 512
-- We want the network to classify the inputs using a one-hot representation of the outputs
local outputSize = 3

local rnn = build_network(inputSize, hiddenSize, outputSize)

--artificially small batchSize again for easy training
-- this can be the number of sequences to train on
local batchSize=5
-- the dataset size is the length of each of the batchSize sequences. 
local dsSize=20
-- number of classes
local nClass = 10

inputs = {}
targets = {}

-- Build up our inputs and targets
-- I2, add code so that if --cuda supplied, these become CudaTensors
-- using the opt.XXX and 'require cunn'
-- I3 - replace this random data set with something more meaningful / learnable
-- and with a realistic testing and validation set
for i = 1, dsSize do
   table.insert(inputs, torch.randn(batchSize,inputSize))
   table.insert(targets, torch.LongTensor(batchSize):random(1,nClass))
end

-- Decorate the regular nn Criterion with a SequencerCriterion as this simplifies training quite a bit
seqC = nn.SequencerCriterion(nn.ClassNLLCriterion())

local count = 0
local numEpochs=100
local start = torch.tic()

--Now let's train our network on the small, fake dataset we generated earlier
while numEpochs ~= 0 do
   rnn:training()
   count = count + 1
   out = rnn:forward(inputs) -- your are feeding batchSize sequences each of length dsSize steps
   err = seqC:forward(out, targets)
   gradOut = seqC:backward(out, targets)
   rnn:backward(inputs, gradOut)
   local currT = torch.toc(start)
   print('loss', err .. ' in ', currT .. ' s')
   --TODO, make this configurable / reduce over time as the model converges
   rnn:updateParameters(0.05)
   -- I5: Are these steps necessary? Seem to make no difference to convergence if called or not
   -- Perhaps they are being called by 
   rnn:zeroGradParameters()
   --   rnn:forget() dont need this as Sequencer handle it directly.
   start = torch.tic()
   -- I6: Make this configurable based on the convergence, so we keep going for bigger, more complex models until they are trained
   -- to an acceptable accuracy
   -- Also add in code to save out the model file to disk for evaluation / usage externally periodically
   numEpochs = numEpochs - 1
end

So I modified a couple of things. In its current (above) form, each epoch the rnn sees the entirety of the dataset. For a real dataset, you would need to add another inner loop where your split the batchSize x dsSize data into chunks of smaller batches of sequences : batchSize x seqLength where seqLength << dsSize.

jundengdeng commented 8 years ago

Thanks for presenting a good RNN example. One thing is not clear for me. What is the form of the input data to RNNs? I think that it should be like seqLength x batchSize x inputSize, is it right?

jrich9999 commented 8 years ago

This is great, Thank You... Now to make it complete, it would be nice to see a Validate and Test section to bring this simple example to completion..

On Thu, Dec 24, 2015 at 3:11 AM, Jun Deng notifications@github.com wrote:

Thanks for presenting a good RNN example. One thing is not clear for me. What is the form of the input data to RNNs? I think that it should be like seqLength x batchSize x inputSize, is it right?

— Reply to this email directly or view it on GitHub https://github.com/Element-Research/rnn/issues/92#issuecomment-167065684 .

hsheil commented 8 years ago

Hi @nicholas-leonard

Thanks for the reply and code. Apologies for the late reply - too much Christmas turkey and pudding!

I ran your example but it failed with "invalid arguments: DoubleTensor number DoubleTensor LongTensor expected arguments: DoubleTensor~1D [DoubleTensor~1D] [double] DoubleTensor~2D DoubleTensor~1D | DoubleTensor~1D double [DoubleTensor~1D] double DoubleTensor~2D DoubleTensor~1D stack traceback:"

Looking at line 45, I think you meant to populate the targets table instead of the inputs twice? I changed it to targets but then I get a different error: "lua/5.1/nn/ClassNLLCriterion.lua:46: Assertion `cur_target >= 0 && cur_target < n_classes' failed. " Printing out the targets table, the ranges look fine but I remember getting this error before when there was a mismatch between my target classes and output cells so when I change line 33 so that nClass == outputSize == 3, it all works and trains nicely!

I document these two issues here for other readers as the biggest thing I find slowing me down are (probably obvious) errors when I haven't wired things together correctly.. I think a Validator class would be a big help to pass models + desired data into and it would pass judgement on whether they will work or not (and over time, helpful hints on how to get them to work). Let me know if you agree with my changes..

For the interested reader, the following network works (make sure you get latest torch and "luarocks install" all the latest rocks so you don't get weird behaviour with mismatches - torch changes fast..).

I'll keep working on the tutorial and update progress here..

require 'rnn'

function build_network(inputSize, hiddenSize, outputSize)
   -- I1: add in a dropout layer
   rnn = nn.Sequential() 
   :add(nn.Sequencer(nn.Linear(inputSize, hiddenSize))) 
   :add(nn.Sequencer(nn.LSTM(hiddenSize, hiddenSize)))
   :add(nn.Sequencer(nn.LSTM(hiddenSize, hiddenSize))) 
   :add(nn.Sequencer(nn.Linear(hiddenSize, outputSize))) 
   :add(nn.Sequencer(nn.LogSoftMax()))
   -- I1: Adding this line makes the loss oscillate a lot more during training, when according to 
   -- http://arxiv.org/abs/1409.2329 this should *help* model performance 
  -- A1: initialization often depends on each dataset. 
   --rnn:getParameters():uniform(-0.1, 0.1)
   return rnn
end

-- Keep the input layer small so the model trains / converges quickly while training
local inputSize = 10
-- Most models seem to use 512 LSTM units in the hidden layers, so let's stick with this
local hiddenSize = 512
-- We want the network to classify the inputs using a one-hot representation of the outputs
local outputSize = 3

local rnn = build_network(inputSize, hiddenSize, outputSize)

--artificially small batchSize again for easy training
-- this can be the number of sequences to train on
local batchSize=5
-- the dataset size is the length of each of the batchSize sequences. 
local dsSize=20
-- number of classes, needs to be the same as outputSize above
-- or we get the dreaded "ClassNLLCriterion.lua:46: Assertion `cur_target >= 0 && cur_target < n_classes' failed. "
local nClass = 3

inputs = {}
targets = {}

-- Build up our inputs and targets
-- I2, add code so that if --cuda supplied, these become CudaTensors
-- using the opt.XXX and 'require cunn'
-- I3 - replace this random data set with something more meaningful / learnable
-- and with a realistic testing and validation set
for i = 1, dsSize do
   table.insert(inputs, torch.randn(batchSize,inputSize))
   -- populate both tables to get ready for training
   table.insert(targets, torch.LongTensor(batchSize):random(1,nClass))
end

for key,value in pairs(targets) do print(value) end

-- Decorate the regular nn Criterion with a SequencerCriterion as this simplifies training quite a bit
seqC = nn.SequencerCriterion(nn.ClassNLLCriterion())

local count = 0
local numEpochs=100
local start = torch.tic()

--Now let's train our network on the small, fake dataset we generated earlier
while numEpochs ~= 0 do
   rnn:training()
   count = count + 1
   out = rnn:forward(inputs) -- your are feeding batchSize sequences each of length dsSize steps
   err = seqC:forward(out, targets)
   gradOut = seqC:backward(out, targets)
   rnn:backward(inputs, gradOut)
   local currT = torch.toc(start)
   print('loss', err .. ' in ', currT .. ' s')
   --TODO, make this configurable / reduce over time as the model converges
   rnn:updateParameters(0.05)
   -- I5: Are these steps necessary? Seem to make no difference to convergence if called or not
   -- Perhaps they are being called by 
   rnn:zeroGradParameters()
   --   rnn:forget() dont need this as Sequencer handle it directly.
   start = torch.tic()
   -- I6: Make this configurable based on the convergence, so we keep going for bigger, more complex models until they are trained
   -- to an acceptable accuracy
   -- Also add in code to save out the model file to disk for evaluation / usage externally periodically
   numEpochs = numEpochs - 1
end

jrich9999 commented 8 years ago

I useds Humphrey's and Nick's above code with my EEG data to see what would happen. Here is the slightly modified version of code:

I'm trying to get it to breakout of training based on the evaluation data set. Compared to a MLP, the Validation error starts high, but I see that the Validation error in the beginning of the LSTM is very low and ramps up as the generalization error increases. I have included two graphs, one going up to epoch 300 and one that is stopped around 188 epoch due to vailidation. The moving mean calculation looks for a change in the average of the last 20 epochs compared to the threshold of the average of the last 120 epoc times 1.1.

I also removed one of the lstm layers, since I had large spikes every 50-60 epochs.

Do I have something wrong here? Why is the evaluation error so low in the beginning?

Any comments are appreciated! - Thanks.. John

print(" - Training Classifier") while epoch < maxEpochs do

   rnn:remember(both)

   rnn:training()
   epoch = epoch + 1

   if(opt.debug) then
     print('     Epoch: ',epoch)
   end

   local start = torch.tic()

   --rnn:dropout(0.5)
   out = rnn:forward(inputTrn) -- feeding batchSize sequences each of

length dsSize steps errTrn = seqC:forward(out, targetTrn) gradOut = seqC:backward(out, targetTrn) rnn:backward(inputTrn, gradOut) local trnDuration = torch.toc(start) --TODO, make this configurable / reduce over time as the model converges

   rnn:updateParameters(learningRate) -- BPTT occurs
   -- I5: Are these steps necessary? Seem to make no difference to

convergence if called or not -- Perhaps they are being called by rnn:zeroGradParameters() --rnn:forget()

   -- Do evaluation
   rnn:evaluate()
   out2 = rnn:forward(inputVal) -- feeding batchSize sequences each of

length dsSize steps errVal = seqC:forward(out2, targetVal)

   if(opt.useValidation) then
   avg120,mvavg120 = mv_avg(errVal,mvavg120)
   avg20,mvavg20 = mv_avg(errVal,mvavg20)
   if(epoch > minEpochs) then
     if(avg20 > avg120*opt.delta20to120) then
       print("   - Moving Avg Validation Break at Epoch:",

epoch,'',avg20,'',avg120*opt.delta20to120) break end end end

   if(errTrn < 10 and errTrn > 8) then
     print(     epoch, 'TrnErr:', errTrn, ' ValErr:', errVal, '

TrnTime:', trnDuration ) end

   -- I6: Make this configurable based on the convergence, so we keep

going for bigger, more complex models until they are trained -- to an acceptable accuracy -- Also add in code to save out the model file to disk for evaluation / usage externally periodically epoc_cnt[epoch] = epoch errorTrn[epoch] = errTrn errorVal[epoch] = errVal if(epoch % 10 == 0) then print(' - epoch:', epoch) gfile:write(logfile,',',epoch,',',errTrn,',',errVal,'\n') rfile:write(logfile,',',epoch,',',errTrn,',',errVal,'\n') end end print(' - Last Epoch:', epoch)

On Sat, Dec 26, 2015 at 7:49 PM, Humphrey Sheil notifications@github.com wrote:

Hi @nicholas-leonard https://github.com/nicholas-leonard

Thanks for the reply and code. Apologies for the late reply - too much Christmas turkey and pudding!

I ran your example but it failed with "invalid arguments: DoubleTensor number DoubleTensor LongTensor expected arguments: DoubleTensor~1D [DoubleTensor~1D] [double] DoubleTensor~2D DoubleTensor~1D | DoubleTensor~1D double [DoubleTensor~1D] double DoubleTensor~2D DoubleTensor~1D stack traceback:"

Looking at line 45, I think you meant to populate the targets table instead of the inputs twice? I changed it to targets but then I get a different error: "lua/5.1/nn/ClassNLLCriterion.lua:46: Assertion `cur_target >= 0 && cur_target < n_classes' failed. " Printing out the targets table, the ranges look fine but I remember getting this error before when there was a mismatch between my target classes and output cells so when I change line 33 so that nClass == outputSize == 3, it all works and trains nicely!

I document these two issues here for other readers as the biggest thing I find slowing me down are (probably obvious) errors when I haven't wired things together correctly.. I think a Validator class would be a big help to pass models + desired data into and it would pass judgement on whether they will work or not (and over time, helpful hints on how to get them to work)

For readers, the following network works (make sure you get latest torch and luarocks install all the latest rocks so you don't get weird behaviour with mismatches - torch changes fast..

I'll keep working on the tutorial and update progress here..

require 'rnn'

function build_network(inputSize, hiddenSize, outputSize) -- I1: add in a dropout layer rnn = nn.Sequential() :add(nn.Sequencer(nn.Linear(inputSize, hiddenSize))) :add(nn.Sequencer(nn.LSTM(hiddenSize, hiddenSize))) :add(nn.Sequencer(nn.LSTM(hiddenSize, hiddenSize))) :add(nn.Sequencer(nn.Linear(hiddenSize, outputSize))) :add(nn.Sequencer(nn.LogSoftMax())) -- I1: Adding this line makes the loss oscillate a lot more during training, when according to -- http://arxiv.org/abs/1409.2329 this should help model performance -- A1: initialization often depends on each dataset. --rnn:getParameters():uniform( -0.1, 0.1) return rnn end

-- Keep the input layer small so the model trains / converges quickly while training local inputSize = 10 -- Most models seem to use 512 LSTM units in the hidden layers, so let's stick with this local hiddenSize = 512 -- We want the network to classify the inputs using a one-hot representation of the outputs local outputSize = 3

local rnn = build_network(inputSize, hiddenSize, outputSize)

--artificially small batchSize again for easy training -- this can be the number of sequences to train on local batchSize=5 -- the dataset size is the length of each of the batchSize sequences. local dsSize=20 -- number of classes, needs to be the same as outputSize above -- or we get the dreaded "ClassNLLCriterion.lua:46: Assertion `cur_target >= 0 && cur_target < n_classes' failed. " local nClass = 3

inputs = {} targets = {}

-- Build up our inputs and targets -- I2, add code so that if --cuda supplied, these become CudaTensors -- using the opt.XXX and 'require cunn' -- I3 - replace this random data set with something more meaningful / learnable -- and with a realistic testing and validation set for i = 1, dsSize do table.insert(inputs, torch.randn(batchSize,inputSize)) -- populate both tables to get ready for training table.insert(targets, torch.LongTensor(batchSize):random(1,nClass)) end

for key,value in pairs(targets) do print(value) end

-- Decorate the regular nn Criterion with a SequencerCriterion as this simplifies training quite a bit seqC = nn.SequencerCriterion(nn.ClassNLLCriterion())

local count = 0 local numEpochs=100 local start = torch.tic()

--Now let's train our network on the small, fake dataset we generated earlier while numEpochs ~= 0 do rnn:training() count = count + 1 out = rnn:forward(inputs) -- your are feeding batchSize sequences each of length dsSize steps err = seqC:forward(out, targets) gradOut = seqC:backward(out, targets) rnn:backward(inputs, gradOut) local currT = torch.toc(start) print('loss', err .. ' in ', currT .. ' s') --TODO, make this configurable / reduce over time as the model converges rnn:updateParameters(0.05) -- I5: Are these steps necessary? Seem to make no difference to convergence if called or not -- Perhaps they are being called by rnn:zeroGradParameters() -- rnn:forget() dont need this as Sequencer handle it directly. start = torch.tic() -- I6: Make this configurable based on the convergence, so we keep going for bigger, more complex models until they are trained -- to an acceptable accuracy -- Also add in code to save out the model file to disk for evaluation / usage externally periodically numEpochs = numEpochs - 1 end

— Reply to this email directly or view it on GitHub https://github.com/Element-Research/rnn/issues/92#issuecomment-167373652 .

nicholas-leonard commented 8 years ago

@jundeng86 yup : seqLength x batchSize x inputSize. The first dimension indexes a table, the remainder, a tensor. Such that there are seqLength tensors of size batchSize x inputSize.

@hsheil You fixed my example : the second inputs should have indeed been targets, and outputSize = nClass. Your tutorial is looking really good. I like that it doesn't have any dp.Experiment and such. It is easy to understand.

@jrich9999 Your code is copy pasted here to provide better syntax highlighting :

print(" - Training Classifier")
  while epoch < maxEpochs do

       rnn:remember(both)

       rnn:training()
       epoch = epoch + 1

       if(opt.debug) then
         print('     Epoch: ',epoch)
       end

       local start = torch.tic()

       --rnn:dropout(0.5)
       out = rnn:forward(inputTrn) -- feeding batchSize sequences each of
length dsSize steps
       errTrn = seqC:forward(out, targetTrn)
       gradOut = seqC:backward(out, targetTrn)
       rnn:backward(inputTrn, gradOut)
       local trnDuration = torch.toc(start)
       --TODO, make this configurable / reduce over time as the model
converges

       rnn:updateParameters(learningRate) -- BPTT occurs
       -- I5: Are these steps necessary? Seem to make no difference to
convergence if called or not
       -- Perhaps they are being called by
       rnn:zeroGradParameters()
       --rnn:forget()

       -- Do evaluation
       rnn:evaluate()
       out2 = rnn:forward(inputVal) -- feeding batchSize sequences each of
length dsSize steps
       errVal = seqC:forward(out2, targetVal)

       if(opt.useValidation) then
       avg120,mvavg120 = mv_avg(errVal,mvavg120)
       avg20,mvavg20 = mv_avg(errVal,mvavg20)
       if(epoch > minEpochs) then
         if(avg20 > avg120*opt.delta20to120) then
           print("   - Moving Avg Validation Break at Epoch:",
epoch,'',avg20,'',avg120*opt.delta20to120)
           break
         end
       end
       end

       if(errTrn < 10 and errTrn > 8) then
         print(     epoch, 'TrnErr:', errTrn, ' ValErr:', errVal, '
TrnTime:', trnDuration )
       end

       -- I6: Make this configurable based on the convergence, so we keep
going for bigger, more complex models until they are trained
       -- to an acceptable accuracy
       -- Also add in code to save out the model file to disk for
evaluation / usage externally periodically
       epoc_cnt[epoch] = epoch
       errorTrn[epoch] = errTrn
       errorVal[epoch] = errVal
       if(epoch % 10 == 0) then
         print('   - epoch:', epoch)
         gfile:write(logfile,',',epoch,',',errTrn,',',errVal,'\n')
         rfile:write(logfile,',',epoch,',',errTrn,',',errVal,'\n')
       end
  end
  print('   - Last Epoch:', epoch)

Usually, for cross-validation (early-stopping), we train the model on the entire training set and then evaluate it on the entire validation set. Because the dataset of this example has only one batch, this is also what is happening here. However, I think we should modify the example so that the dataset has more than one batch per epoch. In this way, the training and validation loops can be added within the loop over epochs.

jrich9999 commented 8 years ago

@hsheil https://github.com/hsheil @jundeng86 https://github.com/jundeng86 @nicholas-leonard I am attempting to use real data and it seems I'm confused how to get it in the correct Batch form for the Humphery/Nicks LSTM example. I think this is import for others to get how the data is feed into the example, I am rather thick headed, so bare with me, please... I'll walk through my assumptions below, please comment where I am off base... Thank You So Much!

So your comment: seqLength x batchSize x inputSize. The first dimension indexes a table, the remainder, a tensor. Such that there are seqLength tensors of size batchSize x inputSize.

For my case, the EEG data I have is :

number of patients "numPatients", each patient has a class label "L" for a disease (L is either - 1 is heather, 2 is sick)
Each patient used 19 EEG sensors - each sensor had a individual number "S" (1 through 19)
Data was collected from each sensor "S" for some period of time representing "time series" of length "L" (say 42 samples)

if I want the LSTM to learn on 1 sensor "S" for all patients "numPatients" each with there own class label "CL"

I want a Hidden Layer Size of 42 = sample time series

I want to set up a Network with: local rnn = build_network(inputSize, hiddenSize, outputSize) -- inputSize=6, hiddenLayerSize=42, outputSize=2 (nClassLabels)

If I want my LSTM to use Batch Learning:

Assume my input size "inputSize" is 6 (I assume I want to keep this small based on your other comments)
if I assume I create 7 "batchSize" of "inputSize" (6 samples each, giving me the 42 samples = "L" = number of hidden layers)

I am guessing that:

The batch size x inputSize should equal the hiddenLayerSize ?
The seqLength is simply "numPatients"

-- Setting Up the data in Torch for your LSTM example for "7 BATCHES of 6 SAMPLES EACH" dataLength = batchSize * inputSize for i = 1,numPatients do local inp_tmp = torch.DoubleTensor(batchSize, inputSize) local s = inp_tmp:storage() for j = 1, dataLength do s[j] = sdata2[trnIndex[i]][j] end table.insert(inputTrn, inp_tmp) -- Appends Table "numPatients" times with each Tensor "batchSize x inputSize"

local tar_tmp = torch.LongTensor(batchSize)
tar_tmp:fill(sclass[trnIndex[i]])
table.insert(targetTrn, tar_tmp)    -- Appends Table "numPatients" with

batchSize (row) by 1(col) Tensor full of the class label "CL" end

On Tue, Dec 29, 2015 at 3:40 PM, Nicholas Léonard notifications@github.com wrote:

@jundeng86 https://github.com/jundeng86 yup : seqLength x batchSize x inputSize. The first dimension indexes a table, the remainder, a tensor. Such that there are seqLength tensors of size batchSize x inputSize.

@hsheil https://github.com/hsheil You fixed my example : the second inputs should have indeed been targets, and outputSize = nClass. Your tutorial is looking really good. I like that it doesn't have any dp.Experiment and such. It is easy to understand.

@jrich9999 https://github.com/jrich9999 Your code is copy pasted here to provide better syntax highlighting :

print(" - Training Classifier") while epoch < maxEpochs do
   rnn:remember(both)

   rnn:training()
   epoch = epoch + 1

   if(opt.debug) then
     print('     Epoch: ',epoch)
   end

   local start = torch.tic()

   --rnn:dropout(0.5)
   out = rnn:forward(inputTrn) -- feeding batchSize sequences each of
length dsSize steps errTrn = seqC:forward(out, targetTrn) gradOut = seqC:backward(out, targetTrn) rnn:backward(inputTrn, gradOut) local trnDuration = torch.toc(start) --TODO, make this configurable / reduce over time as the model converges
   rnn:updateParameters(learningRate) -- BPTT occurs
   -- I5: Are these steps necessary? Seem to make no difference to
convergence if called or not -- Perhaps they are being called by rnn:zeroGradParameters() --rnn:forget()
   -- Do evaluation
   rnn:evaluate()
   out2 = rnn:forward(inputVal) -- feeding batchSize sequences each of
length dsSize steps errVal = seqC:forward(out2, targetVal)
   if(opt.useValidation) then
   avg120,mvavg120 = mv_avg(errVal,mvavg120)
   avg20,mvavg20 = mv_avg(errVal,mvavg20)
   if(epoch > minEpochs) then
     if(avg20 > avg120*opt.delta20to120) then
       print("   - Moving Avg Validation Break at Epoch:",
epoch,'',avg20,'',avg120*opt.delta20to120) break end end end
   if(errTrn < 10 and errTrn > 8) then
     print(     epoch, 'TrnErr:', errTrn, ' ValErr:', errVal, 'TrnTime:', trnDuration )
   end

   -- I6: Make this configurable based on the convergence, so we keep
going for bigger, more complex models until they are trained -- to an acceptable accuracy -- Also add in code to save out the model file to disk for evaluation / usage externally periodically epoc_cnt[epoch] = epoch errorTrn[epoch] = errTrn errorVal[epoch] = errVal if(epoch % 10 == 0) then print(' - epoch:', epoch) gfile:write(logfile,',',epoch,',',errTrn,',',errVal,'\n') rfile:write(logfile,',',epoch,',',errTrn,',',errVal,'\n') end end print(' - Last Epoch:', epoch)

Usually, for cross-validation (early-stopping), we train the model on the entire training set and then evaluate it on the entire validation set. Because the dataset of this example has only one batch, this is also what is happening here. However, I think we should modify the example so that the dataset has more than one batch per epoch. In this way, the training and validation loops can be added within the loop over epochs.

— Reply to this email directly or view it on GitHub https://github.com/Element-Research/rnn/issues/92#issuecomment-167874051 .

nicholas-leonard commented 8 years ago

@jrich9999 Nice concrete example with the EEG data. How would I build the seqLength x batchSize x inputSize batch? Because your sample has 42 time-steps, your seqLength = 42. So your input table will have 42 elements.

The batchSize is arbitrary. Something like 8,16,32 should work well. Larger batchSize means more parallelization on GPU, but slower per-example convergence. There is a sweet spot. Don't look too hard for it.

The inputSize is the dimensionality of the your input sensor data. So if you are using one input sensor, and that sensors outputs a vector of 6 dimensions at each time-step, then inputSize = 6. On the other hand, if you have 19 EEG sensors, each outputing a scalar value at each time-step, then inputSize = 19.

As for the hiddenLayerSize this will determine how much modeling capacity you allocate to the network. So higher means that you can model more complex functions, but also means it it more prone to overfitting the training data. This is a hyper-parameter which you will need to play with. Trying values of 32, 64, 128, 256, ... , you should choose the hiddenLayerSize that gives the best performance on the validation set.

Your outputSize is good.

Okay, so you could organize your data as an input tensor of size seqLength x numPatients x inputSize and an output tensor of size numPatients. To get a batch, you can use input:narrow(2, n,batchSize), target:narrow(1,n,batchSize).

rracinskij commented 8 years ago

Thank you for a great example. One issue is bit confusing - how should the output tensor ideally look like? I changed the dataset to three sequences: [0.1, 0.2 ... 1.0] [1.0, 1.1 ... 2.0] [-1, -0.9 ... -0.1] and labeled them as [1,2,3]. Output tensor after 100 epochs looks something like: -0.0830 -2.8453 -3.8407 -3.7361 -0.0313 -4.9712 -3.2264 -3.5956 -0.0695 -0.0597 -3.2954 -3.8717 -3.3373 -0.0452 -4.7577 So it is near zero at the target label and negative otherwise. Does it look correct? Thank you.

hughperkins commented 8 years ago

@rracinskij : remember hte output is the log of the 'real' output. If you take exp of your values, eg of '-0.0830 -2.8453 -3.8407', you get:

0.92 0.06 0.02

... which I imagine is more in line with your expectations?

rracinskij commented 8 years ago

@hughperkins: Indeed it is :) Thanks a lot!

kmul00 commented 8 years ago

With respect to the above examples, I have a question.

Let's say I have a dataset of size seq_length X data_size X feature_size, where data_size is my number of training examples and data_size >> batch_size. For clarification, lets say,

seq_length = 10
data_size = 50,000
feature_size = 200
batch_size = 32
train_data = torch.randn(seq_length, data_size, feature_size)
train_target = torch.randn(seq_length, data_size, feature_size)

Now, is the following code correct for training the LSTM model on this training set ?

for i = 1, num_epochs do

  rnn:training()
  inputs = torch.Tensor(seq_length, batch_size, feature_size)
  targets = torch.Tensor(seq_length, batch_size, feature_size)

  for j = 1, seq_length do
    for k= 1, data_size, batch_size do
      inputs[{{ j }}]] = train_data[{{ j }, { k, k+batch_size - 1 }}]
      targets[{{ j }}]] = train_target[{{ j }, { k, k+batch_size - 1 }}]
    end
  end

  out = rnn:forward(inputs)
  err = seqC:forward(out, targets)
  print('loss is ', err)
  gradOut = seqC:backward(out, targets)
  rnn:backward(inputs, gradOut)
  rnn:updateParameters(0.05)
  rnn:zeroGradParameters()
end

If yes, then can someone point out that where exactly do we need to use rnn:backwardThroughTime() and rnn:forget() ?

Thanks in advance.

nicholas-leonard commented 8 years ago

@hsheil Depends what your rnn looks like. Can you print it here?

hsheil commented 8 years ago

Hi @nicholas-leonard I think that request came from another guy in the thread, it wasn't me..

hsheil commented 8 years ago

Hi @nicholas-leonard Quick update: I'm making reasonable progress on the tutorial + code. It's going to be a three-parter now:

Section 1 - setting out the groundwork and building the base / core code for future posts.
Section 2 - Moving to use a real-world data set and to train on the GPU.
Section 3 - Performance tuning - how to improve the performance of the model by applying tried and tested optimisations / tricks of the trade when (a) engineering features / input layers for neural networks and (b) tuning the model itself.

Post 1 should be ready tomorrow evening some time. Would be great if you could review it for technical accuracy. I'm using the RecSys 2015 challenge data set - it will be interesting to compare LSTM vs Vowpal Wabbit performance (very early / un-tuned VW performance is documented here: http://humphreysheil.com/blog/a-quick-run-through-vowpal-wabbit).

Let me know what you think.

hsheil commented 8 years ago

Hi @nicholas-leonard

Part one (code and commentary) is now ready for review:

https://github.com/hsheil/rnn-examples https://github.com/hsheil/rnn-examples/blob/master/example_part1.lua https://github.com/hsheil/rnn-examples/blob/master/part1.md

I've left the code here for now until it's signed off.

I'm happyish with the code now in that I think it is naive but correct. It is well-behaved in minimising loss even when I scale up the dsSize to 20_000 or so.

I still don't think I fully understand the seqLength so feel free to critique how I'm using seqLength+batchSize to chunk / index the full epoch - this will make more sense when the example moves to a real data set.

My intuition is that there is no point in me constructing a validation / test set using the torch.randn(batchSize,inputSize) trick as performance will be bad, so I'm putting that code into part two with the real data set.

All feedback appreciated.

nicholas-leonard commented 8 years ago

@hsheil This looks awesome! I like the detailed analysis in part1.md. I submitted a PR with small fixes : https://github.com/hsheil/rnn-examples/pull/1 . Can't wait to see it evolve to the real dataset you want to use. Let me know if you need more help. Once your post is ready I will definitely link it prominently on our README.md. Also, make sure you update dpnn and rnn as a major bug was fixed.

jrich9999 commented 8 years ago

@hsheil Nice!!! Exactly what I needed to move forward. I've been away for a bit, delayed holidays, work, etc. I haven't had much time to resolve getting my validation/test set to work with my data set. But I'm back now, Thanks, this really is good stuff!

rracinskij commented 8 years ago

@hsheil Thank you for a helpful example. Could you please explain why you create batchInputs and batchTargets of 8 (batchSize+seqLength-2)?

hsheil commented 8 years ago

@nicholas-leonard Cool will do. Thanks for the PR, digesting it now :) CUDA code is done now and am seeing a 10x speed-up over CPU on the fake dataset which is a nice illustration. If you like, the posts and code can all go into this repo as a tutorial of sorts - I set up rnn-examples so I could push and pull easily between my dev and CUDA machine while I'm coding.

hsheil commented 8 years ago

@rracinskij the batchSize and seqLength are currently set in proportion to dsSize - this code will be tightened up in the next iteration to not require that precondition (and to add in the separate validation and testing sets). That for loop condition ensures that we present all of dsSize to the network for training - you can verify that by adding in a print() on line 99 to see that each loop gets the right chunk of dsSize for the range [offset, offset+i].

nicholas-leonard commented 8 years ago

@hsheil No need to include in rnn. I like that this is its own separate repository that I don't need to maintain! I hope you still intend to add your real-world e-commerce dataset to the example. That would be awesome.

hsheil commented 8 years ago

@nicholas-leonard Hi, working on it. I was just tuning the parallel Vowpal Wabbit impl as I need to compare LSTM to a good baseline (VW) as part of my research path. ETA on the next instalment is Sunday night UK time :)

kmul00 commented 8 years ago

Going through @hsheil 's example, I have a few questions.

According to the example, I have dsSize examples, each made up of seqLength number of events, and each event is of length inputSize (i.e. my feature length). Now while preparing the toy dataset my inputs size is dsSize X batchSize X inputSize (here). Shouldn't it be dsSize X seqLength X inputSize ?
Why is my training for loop (here) hopping batchSize + seqLength at each step ? From what I understand, inputs[i] will give me the ith batchSize X inputSize data (since in the example my inputs is dsSize X batchSize X inputSize). Then why batchSize + seqLength stepwidth ?
By choosing inputs[offset+i] (here) where offset is initialized to 1 and i is initialized to 2, it is always starting from inputs[3]. Thus inputs[1] and inputs[2] never gets selected. Is that right ?
This one is asked already by someone, but I couldn't understand the reply. Why create batchInputs and batchTargets of size 8 (batchSize + seqLength - 2) ?

Overall, I am still finding it a bit difficult to grasp how the input should be presented for batchwise training in a GPU. Maybe I am having some flaw in my basic understanding only.

My understanding is, while batch training the input should be presented like, TimeStep X BatchLength X FeatureLength, i.e. my first batch should contain my first timestep data. Going through that my LSTM will update it's states and then in the second batch my second timestep data will be presented. Likewise after my LSTM has seen all the TimeStep data for BatchSize, it will output me BatchSize number of outputs (for simplicity, let's consider it's a many-to-one LSTM, i.e. each sequence has only one associated label).

But it is hard to verify it from the example proposed above.

Any help regarding what the input format should be like, will be highly appreciated.

Thanks in advance.

hsheil commented 8 years ago

Hi @nicholas-leonard and the other folks who posted code on this issue: part two is ready (a real-world dataset) - it addresses a faux-pas in part one whereby I was session-oriented and not sequence step-oriented in creating batches for the network. All feedback appreciated!

Write-up: https://github.com/hsheil/rnn-examples/blob/master/lstm-2.md

Code: https://github.com/hsheil/rnn-examples/tree/master/part2

jrich9999 commented 8 years ago

Hi Humphrey @hsheil ,

Great work! Really looks good. Love what you are researching. I am guessing that I may not have the latest Torch files to run your latest Example 2. Since I am still learning about Torch, is there an easy way to update torch? I run the xxx/torch/update.sh and git pulls from the master, but I'm not sure that is all I need to do. Did you pull in specific Torch fixes that are outside of master? Do I need to run a recompile of anything since I will attempt to use cuda? Sorry for some of these basic questions...

Whew, example 2 is using a lot of data.. :)

Thanks, John

/home/john-1404-64/torch/install/bin/luajit: .../john-1404-64/torch/install/share/lua/5.1/nn/Sigmoid.lua:4: attempt to call field 'Sigmoid_updateOutput' (a nil value) stack traceback: .../john-1404-64/torch/install/share/lua/5.1/nn/Sigmoid.lua:4: in function 'updateOutput' ...hn-1404-64/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput' ...n-1404-64/torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function 'updateOutput' ...hn-1404-64/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput' ...n-1404-64/torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function 'updateOutput' ...hn-1404-64/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput' ...n-1404-64/torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function 'updateOutput' ...hn-1404-64/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput' /home/john-1404-64/torch/install/share/lua/5.1/rnn/LSTM.lua:162: in function 'updateOutput' ...hn-1404-64/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput' ...ohn-1404-64/torch/install/share/lua/5.1/rnn/Recursor.lua:24: in function 'updateOutput' ...hn-1404-64/torch/install/share/lua/5.1/rnn/Sequencer.lua:47: in function 'forward' main.lua:155: in main chunk [C]: in function 'dofile' ...4-64/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 john-1404-64@int-1404-64:/home/my/1jr/Prog/WrkSpc_torch/hsheil/rnn-lstm_example2/rnn-examples/part2$

nicholas-leonard commented 8 years ago

@hsheil Not sure I understand what the inputs and outputs are for this section : https://github.com/hsheil/rnn-examples/blob/master/lstm-2.md#feature-design . Maybe you could clarify where you get 194 from that table.

hsheil commented 8 years ago

Hi @nicholas-leonard sure thing. The 194 is made up of 12 (months)+ 31 (days)+ 7 (day of week) + 24 (hour of day) + 60 (minute of hour) + 60 (second of minute) = 194, all encoded in OneHot format. A couple of comments on this:

This is a core input design decision that is "up for grabs", i.e. the alternative is to let the flow of events themselves in each sequence represent the time. I'll code this approach too and see which one performs better.
I'm passing in time, category ID and some basic aggregate scores to LSTM in the current code - the big one to add is the item IDs and pricing information. There are ~53k unique item IDs so I may well need to investigate clustering / some form of information compression here, but as per item 1 on this list, I wanted to code both approaches (present all 53k, present a compressed version of same) as I'd like to understand which works better with LSTM and why specifically.
Using a bit vector doesn't seem to improve accuracy over actually using a smaller input layer and using a range of integers, i.e. just 6 input units as follows: {1-12,1-31,1-7,1-24,1-60,1-60}. That's interesting as I expected the bit vector to make it easier for LSTM to learn but it doesn't seem to..

Hope this all makes sense. I'm going to push evaluator.lua soon which will give a like for like comparison between LSTM and Vowpal Wabbit which will be interesting.

Let me know your thoughts on the first point!

nicholas-leonard commented 8 years ago

@hsheil Sounds good. Could you add that breakdown of the 194 to the docs for clarity? Thanks.

Also, would love to see more images of stuff in tutorial : learning curves, etc. Maybe some concrete examples as what gets predicted, e.g. given A, B, and C, users are most likely to purchase/click on D where A,B,C and D are interesting and human-relatable. Of course, you will need to build your evaluate script first.

hsheil commented 8 years ago

Hi @nicholas-leonard ok, will do. The code has evolved a lot since the original approach - implementing MaskZero was a really important step in improving the training phase and I need to update the docs to reflect this. I've been heads-down on another project but plan to work on this code and docs again this weekend so will ping you when it's ready for another look.

hsheil commented 8 years ago

@nicholas-leonard PS the evaluate script is done and pushed (evaluator.lua in the part2 sub-directory) - it exposed some problems in the original impl that I've been fixing, hence the flurry of related commits.

nicholas-leonard commented 8 years ago

@hsheil Yeah its looking good. Can't wait to see the final tutorial and code.

hsheil commented 8 years ago

@nicholas-leonard Apologies for the delay in this Nicholas. I've been working on the code on and off and my supervisor also wants me to build a side-by-side impl using TensorFlow :)

The last push I did (https://github.com/hsheil/rnn-examples/commit/cb30a484d4b5577c346b9908811452beaa2bfd97) has a lot of improvements, resulting in an F1 score of 0.990 when tuned using Spearmint (and on the validation set to boot). It turns out that calling model:remember('both') resulted in a very significant perf improvement. I think there's a glitch in the docs on this that I'll submit a small PR for.

Next I'm going to revamp the part2 MD file to reflect all of the changes and then move onto documenting part3 (effect of using Spearmint to tune hyper params - already coded just not documented) - probably on the flight to GTC!

Once that's done, I'd actually like to plug dp back in and see what benefit it brings - if the reader comes on the journey through parts 1, 2 and 3, then dp should be pretty accessible at the end of part 3.

nicholas-leonard commented 8 years ago

@hsheil Let me know what you think about TensorFlow w.r.t. Torch. Good to see the project is advancing. I know these things can take time :) I just merged your PR. Was a valid point. I will see you at GTC. Also, I recommend not using dp as I am trying to move away from it, focusing instead on rnn, dpnn, dataload. Can't wait to see the final readme with all the parts listed.

hsheil commented 8 years ago

@nicholas-leonard Ok, good to know RE: dp, it was dp.Experiment that looked most interesting to try and leverage / use. Look forward to catching up!

beldaz commented 6 years ago

@hsheil I stumbled upon this thread while looking for help on a similar problem. However, I see that your writeup repository is no longer visible. Any chance you'd make it available, or point to a new location?

hsheil commented 6 years ago

Hi @beldaz - I moved over to PyTorch quite a while ago so I stopped working on that repo - the code I wrote is very specific to the dataset I was using while I think Torch just needed (at the time) some simpler LSTM examples to go with the docs but I think that was added in the examples repo - @nicholas-leonard et al did quite a bit of work on that. I thought about closing this issue but figured the thread might be useful for some.