cazala / synaptic

architecture-free neural network library for node.js and the browser
http://caza.la/synaptic
Other
6.92k stars 665 forks source link

Output/Hidden to Hidden gatings in LSTM-RNN #30

Closed Sleepwalking closed 9 years ago

Sleepwalking commented 9 years ago

According to Felix Gers' dissertation[1], gates on memory cells have connection not only from input layer, but also the memory cells themselves. However, Architect currently only projects input-to-out/forget/in gates connections for LSTMs (except peepholes),

      inputLayer.project(inputGate);
      inputLayer.project(forgetGate);
      inputLayer.project(outputGate);

which means that the neural network remembers/forgets information only based on its current inputs, which could be diastrous for certain tasks that require long term memory. In some other applications memory cells are even gated by outputs. Besides gating, first order connections from output to hidden layers also exist in some literatures.

This observation provides an insight for why the Wikipedia language modeling task doesn't give promising results even after hours of training. Through informal test enabling hidden layer to gates connections, the network is able to reproduce text such as "the of the of the of the of the of ..." on its own. I also trained a LSTM with 70 memory cells on some short paragraphs and the network can exactly reproduce two or three sentences on its own.

I'm going to run further tests to compare the hidden-to-gates connected LSTMs with input-to-gates connected ones.

[1] Gers, Felix. Long short-term memory in recurrent neural networks. PhD dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2001.

cazala commented 9 years ago

@Sleepwalking what do you mean exactly when you say that memory cells should also be gated by themselves? Currently memory cells project a connection to theirselves:

var self = memoryCell.project(memoryCell);

And this connection is gated by the forget gate:

forgetGate.gate(self, Layer.gateType.ONE_TO_ONE);

Its amazing that you got a LSTM to reproduce whole sentences, I'm very interested in this, but I still don't get how's the topology of the network you are using, could you explain a little more or share some of the code you are using to connect the memory cells?

Sleepwalking commented 9 years ago

Hello @cazala,

Precisely speaking, the gates to memory cells have 1st-order connections from the memory cells themselves:

memoryCell.project(forgetGate);
memoryCell.project(inputGate);
memoryCell.project(outputGate);

Sorry that I just discovered that my implementation used in the "informal test" didn't included hidden-to-gates connections (because I made lots of local forks and they messed up and I ran the code on a wrong fork...) This seems to contradict with my hypothesis because my 30-72-30 LSTM works unbelieveably well on small language modeling tasks where training sets only contain a few lines of words. The network is able to remember up to one hundreds steps of characters. Anyway I'm doing the test with hidden-to-gates connections now and I'll post the results as soon as possible.

Kanru Hua

cazala commented 9 years ago

I see, so all the memory cells project connections to all the input/forget/output gates? because now we have peephole connections projected from each memory cell to its corresponding gates:

// peepholes
memoryCell.project(inputGate, Layer.connectionType.ONE_TO_ONE);
memoryCell.project(forgetGate, Layer.connectionType.ONE_TO_ONE);
memoryCell.project(outputGate, Layer.connectionType.ONE_TO_ONE);

It would be great if you could share the result of your tests when you finish them (:

Sleepwalking commented 9 years ago

Large networks are trained notoriously slow... I actually translated the generated codes into C and compiled them using clang; encapsulated into MEX files and trained & tested the network using GNU Octave. LSTM networks with 30 binary inputs, 72 memory cells, and 30 binary outputs are trained to predict the next character based on current character (and its memory). Input/Output format: | 26: a-z | 1:period or comma | 1:space | 1:quote | 1:parentheses | The training set is some super stupid non-sense, pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that kaba can kill kata. this is so nice. specially designed to see if the network can capture the context because pata, kata, kaba and pala... occur multiple times in the text. To work around initialization problems, the training set is repeated lots of times and joined together to give a 7000-character-long text,

pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so 
much. kaba wants to kill pata so that kaba can kill kata. this is so nice. pata thinks that kata 
wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants 
to kill pata so that kaba can kill kata. this is so nice. pata thinks that kata wants the pala of 
the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that 
kaba can kill kata. this is so nice.
...

The LSTM is trained using stochastic gradient descent with initial learning rate = 0.01. Learning rate decays by 0.99 times until it is less than 0.005.

Result Without Hidden-to-Gates Connections

Epoch 1, MSE = 0.830720
Epoch 2, MSE = 0.724659
Epoch 3, MSE = 0.692267
Epoch 4, MSE = 0.670466
Epoch 5, MSE = 0.647484
Epoch 6, MSE = 0.619622
Epoch 7, MSE = 0.589051
Epoch 8, MSE = 0.558995
Epoch 9, MSE = 0.530920
Epoch 10, MSE = 0.505305
Epoch 11, MSE = 0.482118
Epoch 12, MSE = 0.460689
Epoch 13, MSE = 0.440006
Epoch 14, MSE = 0.419213
Epoch 15, MSE = 0.397737
Epoch 16, MSE = 0.375261
Epoch 17, MSE = 0.351678
Epoch 18, MSE = 0.327267
Epoch 19, MSE = 0.302879
Epoch 20, MSE = 0.279280
Epoch 21, MSE = 0.256809
Epoch 22, MSE = 0.235538
Epoch 23, MSE = 0.215458
Epoch 24, MSE = 0.196564
Epoch 25, MSE = 0.178876
Epoch 26, MSE = 0.162826
Epoch 27, MSE = 0.149508
Epoch 28, MSE = 0.136092
Epoch 29, MSE = 0.123352
Epoch 30, MSE = 0.111187

Output sample:

t kill pata thant kill kata so this kata the kaba wans thes o much. kill pata so that kaba wans
to kill pata so that kaba can kill kata this so nice. kata. this inis s nice. pata this s o nice. kill
pata that kaba so nthe kaba kaba so chice. kaba so nthice. kata kaba wants thes the pala 
of the kaba so much. kaba can kill kata this so inis s nice. pata this so nice. kata thinks that 
kata so the kaba wants the pala of the kaba wants to kill kata so that kaba can kill kata so 
this so nice. pata thinks that kata wants the pala of the kaba wants the kaba so much. kaba
can kill kata so this so nice. pata thinks that kata wants the pala of the kaba wants to kill pata
so tha kaba can this kil so katat kata wants the pala of the kaba wants to kill kata so that kaba
can kill kata so this so nice. kill pata tha kata wants the kaba wanso this s o nice. pata thinks
that kata wants the pala of the kaba wants to kill kata so that kaba can kill kata so this so nice.
pata thinks that kata wants the pala of the kaba wants the kaba so much.
...

Result With Hidden-to-Gates Connections

Epoch 1, MSE = 0.841429
Epoch 2, MSE = 0.731850
Epoch 3, MSE = 0.691192
Epoch 4, MSE = 0.663000
Epoch 5, MSE = 0.657220
Epoch 6, MSE = 0.652227
Epoch 7, MSE = 0.628353
Epoch 8, MSE = 0.613436
Epoch 9, MSE = 0.583812
Epoch 10, MSE = 0.567494
Epoch 11, MSE = 0.529668
Epoch 12, MSE = 0.505728
Epoch 13, MSE = 0.470627
Epoch 14, MSE = 0.443538
Epoch 15, MSE = 0.419392
Epoch 16, MSE = 0.398090
Epoch 17, MSE = 0.371459
Epoch 18, MSE = 0.347105
Epoch 19, MSE = 0.325191
Epoch 20, MSE = 0.315286
Epoch 21, MSE = 0.299850
Epoch 22, MSE = 0.278598
Epoch 23, MSE = 0.266516
Epoch 24, MSE = 0.243895
Epoch 25, MSE = 0.251854
Epoch 26, MSE = 0.231868
Epoch 27, MSE = 0.225016
Epoch 28, MSE = 0.253627
Epoch 29, MSE = 0.217485
Epoch 30, MSE = 0.203307

Output sample:

this this so thinininise. ts the pata the pala of the kaba ba can kill kata so kila so this so thiso this
so thiso the pata the pala of the kaba ba can kill kata so kila so this so thiso this so thiso the pata
the pala of the kaba ban cats the kaba wants the kaba so the kaba can kill kata so kata this s ice.
pata this this so the paba of the kaba. can kill kata this s ice. pata this the kaba ban kila so ice.
killl pata that kaba so kanis this so thiso thise so that kata so that kaba can kill kata this s ice.
pata this the kaba ban cats the kaba wants the kaba so the kaba canks this the kaba ba can kill
kata so kill so ice. pata that kata wants the kaba wants the kaba of kill kata so this so thiso this
so the paba o cathis the kaba wants the kaba of the kaba wants thes o kaba cants the kaba
wants the kaba so the kaba can kill kata so kata this s ice. pata this this so the paba of the kaba.
can kill kata this s ice. pata this this so the paba of the kaba ba cants the killats s tha o kaba can
kill so ice. pata this this so thininise ts t ...

It's kind of weird that the one with hidden-to-gates connections perform even worse than the original one. This is probably because the training hasn't converged yet. Seems like the learning rate is so high that MSE bounces back after 25 epochs. Note that the later one is 3 times larger than the one without hidden-to-gates connections; it gives a 23MB C file and training is extremely slow even translated to C!

cazala commented 9 years ago

That's way better than the results I got on the Wikipedia language modeling task. You said you translated the generated code into C.. you mean the hardcded, optimized javascript code, right? Could you share the network you used to generate that optimized code? (The network itself).

Or did you use the LSTM that's already in the Architect for the first result (without hidden-to-gates)?

Sleepwalking commented 9 years ago

Yeah I used the one in Architect. For the other one with hidden-to-gates I inserted three lines at https://github.com/cazala/synaptic/blob/master/src/architect.js#L108,

      // hidden-to-gates
      memoryCell.project(inputGate);
      memoryCell.project(outputGate);
      memoryCell.project(forgetGate);

Now I'm running more epochs and see if it works better. Each epoch takes half a minute... better to test this on some simplified tasks.

The work by Gers is way more complicated than Architect.LSTM - hidden-to-gates, output-to-gates, etc.; he also puts an intermediate layer between memory cells and output layer - if I remember correctly. There are tons of variations on LSTM-RNNs and would take us lots of time to try them out.

menduz commented 9 years ago

Do you exported the network of synaptic directly to C?

On Wed, Apr 29, 2015 at 11:45 AM, Hua Kanru notifications@github.com wrote:

Large networks are trained notoriously slow... I actually translated the generated codes into C and compiled them using clang; encapsulated into MEX files and trained & tested the network using GNU Octave. LSTM networks with 30 binary inputs, 72 memory cells, and 30 binary outputs are trained to predict the next character based on current character (and its memory). Input/Output format: | 26: a-z | 1:period or comma | 1:space | 1:quote | 1:parentheses | The training set is some super stupid non-sense, pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that kaba can kill kata. this is so nice. specially designed to see if the network can capture the context because pata, kata, kaba and pala... occur multiple times in the text. To work around initialization problems, the training set is repeated lots of times and joined together to give a 7000-character-long text,

pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that kaba can kill kata. this is so nice. pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that kaba can kill kata. this is so nice. pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that kaba can kill kata. this is so nice. ...

The LSTM is trained using stochastic gradient descent with learning rate = 0.01. Result Without Hidden-to-Gates Connections

MSE at epoch 30 = 0.111187 Output sample:

t kill pata thant kill kata so this kata the kaba wans thes o much. kill pata so that kaba wans to kill pata so that kaba can kill kata this so nice. kata. this inis s nice. pata this s o nice. kill pata that kaba so nthe kaba kaba so chice. kaba so nthice. kata kaba wants thes the pala of the kaba so much. kaba can kill kata this so inis s nice. pata this so nice. kata thinks that kata so the kaba wants the pala of the kaba wants to kill kata so that kaba can kill kata so this so nice. pata thinks that kata wants the pala of the kaba wants the kaba so much. kaba can kill kata so this so nice. pata thinks that kata wants the pala of the kaba wants to kill pata so tha kaba can this kil so katat kata wants the pala of the kaba wants to kill kata so that kaba can kill kata so this so nice. kill pata tha kata wants the kaba wanso this s o nice. pata thinks that kata wants the pala of the kaba wants to kill kata so that kaba can kill kata so this so nice. pata thinks that kata wants the pala of the kaba wants the kaba so much. ...

Result Without Hidden-to-Gates Connections

MSE at epoch 30 = 0.203307 Output sample:

this this so thinininise. ts the pata the pala of the kaba ba can kill kata so kila so this so thiso this so thiso the pata the pala of the kaba ba can kill kata so kila so this so thiso this so thiso the pata the pala of the kaba ban cats the kaba wants the kaba so the kaba can kill kata so kata this s ice. pata this this so the paba of the kaba. can kill kata this s ice. pata this the kaba ban kila so ice. killl pata that kaba so kanis this so thiso thise so that kata so that kaba can kill kata this s ice. pata this the kaba ban cats the kaba wants the kaba so the kaba canks this the kaba ba can kill kata so kill so ice. pata that kata wants the kaba wants the kaba of kill kata so this so thiso this so the paba o cathis the kaba wants the kaba of the kaba wants thes o kaba cants the kaba wants the kaba so the kaba can kill kata so kata this s ice. pata this this so the paba of the kaba. can kill kata this s ice. pata this this so the paba of the kaba ba cants the killats s tha o kaba can kill so ice. pata this this so thininise ts t ...

It's kind of weird that the one with hidden-to-gates connections perform even worse than the original one. This is probably because the training hasn't converged yet. Seems like the learning rate is so high that MSE bounces back after 25 epochs. Note that the later one is 3 times larger than the one without hidden-to-gates connections; it gives a 23MB C file and training is extremely slow even translated to C!

— Reply to this email directly or view it on GitHub https://github.com/cazala/synaptic/issues/30#issuecomment-97453318.

Agustin Mendez

Sleepwalking commented 9 years ago

@menduz Yes. I'm actually working on a fork that directly outputs C codes. Once it's fully functioning I'll push it to github.

Sleepwalking commented 9 years ago

MSE for the first 60 epochs. Learning rate starts from 0.01 and decays by 0.99 times each epoch until being less than 0.001. When MSE 'explodes' (being greater than 1), the trainer automatically restores the network to the last epoch when MSE did not explode. 1st

Output sample:

thinks that kaba wants to kill pata so that kaba can kanka tha kanka tha kanka tha kanka tha kanka
 tha kanka tha kanka kil kilata so tha ka bchabs kisa to kanil that ka ba can kill kata. this s is nice. 
pat kilata so tha ksant so kaninks thininil pats ps ps nkininill pat pantst t kat kisats panis tha to 
kabcs kil pata tha of kilata ts nkil pa soant tkil kata wanininis tinise. t psn pinis the pa kat kila. 
tkaninil tha ts ka this ka thika so isn pis nkice. panininininininisesnksinil patt pat kilan tst t t ka 
thankss kinill nka. that kil kata. this kis pant that pa of kis ps nkilinks pspa ths n panint tka. this 
ksa tha ois nin pantso tha tha kants kis kinice. pat kilata ts kil pants kininil pats ta that pa o kis 
ps nkininil pats sn ps nkice. panininill pat kilan tininininise. paninise. the t ka. tha the ka.a kil 
katants tha kants th kaba wants the pa o kila. thaso t kas nkice. pa il pants kil pants kilu pant 
kathis so n pants kininilt pat ksath kso tha ka bchat kaba wanto kill kata so that ka bcan kill kata. 
this s kanis inices patt kant ...

Although MSE gets lower than the network without hidden-to-gates, the output is worse. Seems like the network overfits after 47 epochs (it gets unstable and explodes regularly after 47 epochs). We really need some smaller scale tasks. It took me half an hour to train this stuff.

cazala commented 9 years ago

Just a thought: you say that the network with hidden-to-gates connections produces a better output, while the MSE is higher.. I read one or two articles stating that for classification tasks like this one, sometimes it is more representative to use cross entropy as the cost function, instead of MSE. There's an implementation included in the Trainer (Trainer.cost.CROSS_ENTROPY). You just pass it in the train() options, like this:

trainer.train(trainingSet,{
    rate: .1,
    iterations: 20000,
    error: .005,
    shuffle: true,
    log: 1000,
    cost: Trainer.cost.CROSS_ENTROPY
});

Maybe you could try to use that one and see if the hidden-to-gates network produces a lower error.

Oh, and +1 for the C code generator (:

menduz commented 9 years ago

If you have a draft of the C generator we have some ideas of simd SSE optimizations. I think we can archieve 8x speedup in some scenarios

El mié, abr 29, 2015 13:30, Juan Cazala notifications@github.com escribió:

Just a thought: you say that the network with hidden-to-gates connections produces a better output, while the MSE is higher.. I read one or two articles stating that for classification tasks like this one, sometimes it is more representative to use cross entropy as the cost function, instead of MSE. There's an implementation included in the Trainer ( Trainer.cost.CROSS_ENTROPY). Maybe you could try to use that one and see if the hidden-to-gates network produces a lower error.

Oh, and +1 for the C code generator (:

— Reply to this email directly or view it on GitHub https://github.com/cazala/synaptic/issues/30#issuecomment-97491255.

Sleepwalking commented 9 years ago

@cazala Thanks for pointing out the cross entropy criterion :) But the Trainer seems just measure the error and display them instead of actually changing the error function used in back propagation (and https://github.com/cazala/synaptic/blob/master/src/neuron.js#L119 has already adopted MCE criterion, implicitly). We've also seen that the output sample of the h2g network is bad so I don't think MCE of the h2g network would outperform the one without h2g.

@menduz Nice. I'm also aiming for some SIMDs cause I saw something like this in the codes,

 F[132948] = F[132020] * F[132920] * influences[27];
 F[132949] = F[132020] * F[132920] * influences[28];
 F[132950] = F[132020] * F[132920] * influences[29];
 F[132951] = F[65];
 F[132952] = F[132020] * F[132951] * influences[0];
 F[132953] = F[132020] * F[132951] * influences[1];
 F[132954] = F[132020] * F[132951] * influences[2];
 F[132955] = F[132020] * F[132951] * influences[3];
 F[132956] = F[132020] * F[132951] * influences[4];
 F[132957] = F[132020] * F[132951] * influences[5];
 F[132958] = F[132020] * F[132951] * influences[6];
 F[210] += F[49755] * F[239782];
 F[210] += F[49758] * F[239783];
 F[210] += F[49761] * F[239784];
 F[56820] += F[0] * F[210];
 F[210] = F[49674] * F[239786];
 F[210] += F[49677] * F[239787];
 F[210] += F[49680] * F[239788];
 F[210] += F[49683] * F[239789];
 F[210] += F[49686] * F[239790];
 F[210] += F[49689] * F[239791];
 F[210] += F[49692] * F[239792];

which can be easily optimized by packed dot products & element-wise products.

We also need to be cache-friendly. Performance may double for small networks because their binaries are below 64KB (size of L1 instruction cache). Basically I'm going to do it in the same manner as IB-EE-FFT. I suggest we'd better open a new issue which focus on optimization.

Sleepwalking commented 9 years ago

Fresh Test Result on the New Task

I found this task in Herbert Jeager's A Tutorial on Training Recurrent Neural Networks. This task is about timing and was originally introduced for testing Echo State Networks. I tested two different LSTM configurations (w/o hidden-to-gates) on this task.

Task Description

The task has 2 inputs and 1 output. These two figures from the above tutorial describe the task well: screenshot from 2015-04-30 08 39 36 screenshot from 2015-04-30 08 39 44

LSTM Topology

var myLSTM = new Architect.LSTM(2, 15, 1);
myLSTM.layers.input.set({squash: Neuron.squash.IDENTITY});
myLSTM.layers.output.set({squash: Neuron.squash.IDENTITY});

The network with h2g connections has the following lines in place of peephole connections at https://github.com/cazala/synaptic/blob/master/src/architect.js#L110,

      // hidden-to-gates
      memoryCell.project(inputGate);
      memoryCell.project(outputGate);
      memoryCell.project(forgetGate);

Training Configuration

Both networks are exported to C, compiled by clang-3.5 (double precision, default optimization); The generation of train & test set as well as training & testing are implemented in Octave; Learning rate = 0.03 throughout the training; An epoch is defined as a full pass through the whole training set; Since there is only one output, I still measured the MSE instead of MCE. Measurements are taken on both train set and test set; Training takes 500 epochs; The train set has 7000 samples and the test set has 1000 samples.

Data set generation

Once data set is randomly generated, the same data set is used to train both two networks.

Nt = 8000;
Ntest = 1000;
t = 1;
train_input  = zeros(Nt, 2);
train_target = zeros(Nt, 1);
while(t < Nt - 20)
    n = round(rand * 20);
    train_input(t, 1) = 1;
    train_input(t:t + n, 2) = n / 20;
    train_target(t:t + n) = 0.5;
    t += n;
    n = round(rand * 20);
    train_input(t + 1:t + n, 2) = train_input(t, 2);
    t += n;
end
Nt -= Ntest;
test_input = train_input(Nt + 1:end, :);
test_target = train_target(Nt + 1:end);
train_input = train_input(1:Nt, :);
train_target = train_target(1:Nt);

Results

LSTM without h2g

To make sure it converges I ran extra 100 epochs. 1st

Output Sample 1st-sample

LSTM with h2g

2nd

Output Sample 2nd-sample

Comments

Sleepwalking commented 9 years ago

Sometimes the MSE converges within 100 epochs while some other times it doesn't converge after 500 epochs.

This is because 15 hidden units are too many for our simple timing task. I tried again with a 2-5-1 topology and here are the new results.

A bug in forward propagation was fixed before this test. The learning rate is set to 0.1 and decays 0.99 times each epoch. Other settings remain same.

LSTM without H2G

3rd

Output Sample 3rd-sample

LSTM with H2G

4th

Output Sample 4th-sample

Comments

Sleepwalking commented 9 years ago

The following experiment shows how LSTM with H2G can accomplish tasks that LSTM without H2G cannot achieve.

Timing Task 2

LSTM without H2G

The "and" in the title means the network still has parallel activation sequence. Sorry for this. I'm not native English speaker. 1st

Output Sample 1st-sample

LSTM with H2G

2nd

Output Sample 2nd-sample

cazala commented 9 years ago

That's impressive. Do you think we should replace the current LSTM topology with the H2G one? We should also take into account that the later one has more connections, do you think if we compared H2G with an LSTM with a similar amount of connections (by adding more hidden units) it would outperform it in the same way?

Note: an easy way to know the number of neurons/connections (that is not documented) is to call Neuron.quantity(). It returns something like: {neurons: 680, connections: 11430}.

The number of neurons and connections is global (for all the Networks)

Sleepwalking commented 9 years ago

@cazala The above results already shown that 2-5-1 H2G (28KB) outperformed 2-15-1 without H2G (70KB). I suggest we add an option to Architect.LSTM and let users choose if there should be H2G connections :)

cazala commented 9 years ago

:+1: maybe even h2g should be the default topology with one-to-one peepholes as the option

Sleepwalking commented 9 years ago

2nd-sample Best-looking result on test set ever got on Timing Task 2. 2-7-1 topology with hidden-to-hidden and hidden-to-gate connections.