LSTM question - Githubissues

cazala / synaptic

architecture-free neural network library for node.js and the browser

http://caza.la/synaptic

Other

6.92k stars 666 forks source link

LSTM question #175

Open ldenoue opened 7 years ago

ldenoue commented 7 years ago

In http://caza.la/synaptic/#/dsr, input dimension is 6, output is 2. I thought an LSTM had the same input and output dimensions. When the output dimension is smaller, they use CTC algorithm to "align" the output sequence to the desired shorter label sequence. Could you clarify how your implementation manages to have different output dimensions? Thanks, Laurent

Jabher commented 7 years ago

@cazala this it yours thing to talk about :)

cazala commented 7 years ago

A regular LSTM is a network with 6 layers: input layer, input gate, forget gate memory cell, output gate and output layer (activated in that order). The layers that must have the same lenght are the input gate, forget gate, memory cell and output gate, basically because for each neuron in the memory cell, there's one neuron in the input gate gating all the inbound connection, a neuron in the forget gate gating the cell's self connection, and a neuron in the output gate gating all the outbound connections. The arrangement of these four neurons is called a memory block. The size of the input layer and output layer is independent of the size of the memory block, since they have all-to-all connections to the memory cell layer.

ldenoue commented 7 years ago

Yes, thanks Juan. For others, I attached below my email answer to your email. One question still: do you know somebody who has implemented an alignment algo like CTC in Synaptic?

"I reread the excellent OCR work published by Breuel et al on using lstm nets for OCR. You're actually right: they use an input size of 32 and output size of k (k being the number of letters in the Unicode they need to recognize). The need for CTC in their case is that they don't have an a priori alignment of the pixel columns (inputs) to the Unicode values: all they know in their training dataset is that one line image maps to a string of Unicode characters. That's why they use the CTC layer to train the net."