Closed jacobdevlin closed 8 years ago
loop in @pluskid for discussion, who is leading the design of hybrid imperative and symbolic approach for supporting variable length and other patterns.
@jacobdevlin Thanks for sharing the information. We agree that variable-length sequence modeling is an important part of deep learning libraries and we are currently starting to implement that in MXNet.
The idea you mentioned is very nice. To my understanding, it instantiate several explicit unrolled models of several chosen lengths (e.g. 10, 25, 50), and the appropriate one will be used to handle each mini-batch. We are trying to come up with an API that makes automatic unrolling as easy as possible. Then supporting for this kind of behavior should follows easily.
But I don't understand your comments about "end of sequence token" and you concern that it
would be very difficult/impossible to implement sequence-to-sequence+attention with the “end of sequence token” method
Can you elaborate or provide some pointers? Thanks a lot!
@pluskid The primary papers that describe the "sequence-to-sequence+attention" papers are here: Neural Machine Translation by Jointly Learning to Align and Translate
Sequence to Sequence Learning with Neural Networks
You can see that they were originally developed for machine translation, but they have since been used for many other tasks (image/video captioning, conversational agents, semantic parsing, etc.)
So for the machine translation case, you have a source sentence (e.g., "el hombre es alto") and a target sentence (e.g., "the man is tall").
The state of the art model is to do the following:
Run a bidrectional RNN over the source:
SF0 → SF1 → SF2 ... SF19
SB0 ← SB1 ← SB2 ... SB19
Where SF
refers to the forward source recurrent layer, and SB
is the backwards source recurrent layer (these will generally be LSTMs or GRUs).
Then you take the final SF
state and final SB
state, concatenate them, project them through a fully connected layer, and use that as the initial recurrent state in your target network:
SF19 + SB0 → T'
Next, for each source word, you append these layers and feed them through another fully connected layer:
SF0 + SB0 → S0
SF1 + SB1 → S1
...
Finally you run a target recurrent network, which also computes attention over the source representation S0, S1, ...
(basically a softmax that is performed at every step).
Now imagine you were trying to do this to support variable length sequences using an "end of sequence token". So you unroll your network to a max of 50 source words and 50 target words. But for the current batch, you want to limit it to 20 source words and 20 target words. You would have a bunch of problems:
SF
and the first 20 from SB
, because the first 20 from SB
actually corresponds to words 30-49, not 0-19.SF
and final SB
is also a problem with early termination. It would be expecting to to be connected to SF49
and SB0
. But we actually need it to be connected to SF19
and SB0
.S0
through S19
. But it will actually try to attend to S0
through S49
. Making it only attend to S0
through S19
would require a lot of special logic. TensorFlow supports such a model with variable length sequences out of the box without any special logic, using the method that I had described.
I'm pretty sure that MxNet can support such a network also, just not with variable length sequences (Although I haven't figured out whether the attention layer can be implemented in just the python layer or requires its own C++ operator, but that's a separate question).
@jacobdevlin Sorry for the late reply. Was traveling and trying to build a basic prototype of sequence APIing during the last week. We will discuss to finalize that this weekend. I guess we can discuss about the details during the meeting. But here are some thoughts related to your comments above.
I think your concern that explicitly unrolled models have this issue of having to deal with sequences that are shorter than the unrolled length, especially in attention models and bidirectional models. But I do not understand how the Tensorflow way could resolve this. Based on my understanding of your description above, they explicitly unroll for several possible lengths (5, 10, 15, 25, etc.). But unless they do this for every possible sequence length, there is still cases where the actual sequence length does not match the unrolled length. Right? Are you talking about this officien seq2seq example in tensorflow? I will try to have a look at that before our discussion.
The prototype we come up is a hybrid imperative + symbolic solution with minimum modification to the existing MXNet internals. Basically it uses the Python side to explicitly schedule recurrence and the computation graph is instantiated only for one time step.
FullyConnected
or Convolution
). It could be a sub computation graph that combines existing operators. As a result, it becomes very flexible to construct recurrent "operators" purely in Python by composing a subgraph using symbolic API. So for example, an LSTM cell, or the attention module -- a softmax that summarize the last 3 inputs for example -- could easily be defined in Python. I'll send the actual meeting time by tonight. We can discuss more details during the meeting. We appreciate a lot all your inputs and feedbacks!
@pluskid In the TensorFlow model (disjoint graph support) it is still up to the user to bucket their data and pad each segment to the closest bucket. So one minibatch would contain all segments of length 11-25, padded to be exactly 25. In most applications I can think of, this is fine.
The whole network has to have the same recurrence sequence pattern (e.g. forward sweeping through the input sequence, backward sweeping, etc.).
Most architectures which get state-of-the-art results on various NLP tasks (and not just NLP -- speech recognition, video processing, etc) do not have such a simple pattern. We could design an API which works for certain architectures now (like seq2seq+attention), but who knows what type of architectures will become popular in the near future. The only truly general mechanism is to let the user specify whatever graph they want -- which MxNet already lets you do, but only for a single connected graph.
For example, consider the recursive neural network architecture where the network is a binary tree. The leaf nodes are the data, and the intermediate nodes all share the same parameters (it could be a fully connected layer or an LSTM-like layer).
N
/ \
N N
/ \ / \
a b c d
It would be extremely difficult to design an API that allows for variable length sequences here. But the bucketing method would work naturally.
There are other advantages to the bucketing method (e.g., supporting multiple sub graphs that share parameters), namely, they easily allow for multi-task/multi-modal training. For example, you could train a translation model and a language model jointly and tie certain parameters together. In this case, you would specify two disjoint graphs, and randomly alternate whether the minibatch corresponds to the translation model or the language model. The embedding layer (and possibly other layers) would be tied together.
Hi, what happened to this? Also - I could not access the link you posted above.
Broken link alternative - https://mxnet.incubator.apache.org/faq/bucketing.html
MxNet seems like a super awesome library, but it may be missing a piece of functionality that is critical to a wide variety of tasks, which is the support for variable length sequences.
In the LSTM example, the segments are split into fixed length chunks (e.g., 35 words) which may span segment boundaries. This works OK for language modeling, but for many other tasks (neural machine translation, neural parsing, video captioning, recurrent speech recognition, etc.), the training algorithm must respect the the exact segment boundaries of the data. In that case, the only way to use MxNet is to pad each segment to be the maximum length of any sequence in the data, which could slow down training by 5x or more (e.g., if the maximum segment length is 50 words and the average segment length is 10 words).
I understand that because of the graph optimization step and minibatching, supporting truly variable-length sequences is difficult. However, there is a very nice solution used by TensorFlow, which is to support disjoint subgraphs with multiple "terminal symbols". Note that this solution does not have any explicit concept of variable length sequences.
For example, the existing MxNet LSTM language model looks like this (e.g.,
R
is the recurrent layer andO
is the output layer):In the proposed method, the user can specify disjoint sub-graphs which form "buckets". For example, in the below case, the user would specify a single disjoint graph, which has three subgraphs corresponding to sequence length 50, 25, and 10.
Just like
R0
,R1
, etc. share parameters in the original,R0A
,R1A
,R0B
... would share parameters here.At runtime, the user would specify data as usual, plus exactly one terminal symbol. It's the user's job to split their data into minibatches and determine the appropriate bucket. E.g., one minibatch would be a collection of segments which are all length 11-25, which have been padded to be exactly length 25. Then, to train on a particular batch they would specify
data=[w_0_0 ... w_0_24; w_1_0 ... w_1_24; ...], terminal=LossB
Critically, since each minibatch can only be associated with exactly one terminal symbol, the graph optimizer would know that any disconnected symbols will never be used in the same minibatch. This would allow it to be efficient with memory, and specifying multiple buckets (subgraphs) would almost no additional memory compared to just a single graph of the maximum length.
An alternative approach which has been discussed/used by other frameworks is to allow the user to specify a special "end of sequence token" in the data binding, which will terminate the graph early. Unfortunately this does not generalize very well. It would be very difficult/impossible to implement sequence-to-sequence+attention with the “end of sequence token” method, since the target hidden state must be connected to the final source state, and the target sequence must attend to only a portion of the source state. In the above “bucket” method, supporting this type of architecture (or any other) with variable length-sequences comes naturally.