Support for variable length sequences during training

jacobdevlin commented 8 years ago

MxNet seems like a super awesome library, but it may be missing a piece of functionality that is critical to a wide variety of tasks, which is the support for variable length sequences.

In the LSTM example, the segments are split into fixed length chunks (e.g., 35 words) which may span segment boundaries. This works OK for language modeling, but for many other tasks (neural machine translation, neural parsing, video captioning, recurrent speech recognition, etc.), the training algorithm must respect the the exact segment boundaries of the data. In that case, the only way to use MxNet is to pad each segment to be the maximum length of any sequence in the data, which could slow down training by 5x or more (e.g., if the maximum segment length is 50 words and the average segment length is 10 words).

I understand that because of the graph optimization step and minibatching, supporting truly variable-length sequences is difficult. However, there is a very nice solution used by TensorFlow, which is to support disjoint subgraphs with multiple "terminal symbols". Note that this solution does not have any explicit concept of variable length sequences.

For example, the existing MxNet LSTM language model looks like this (e.g., R is the recurrent layer and O is the output layer):

R0 → R1 → R2 ... R49
↓     ↓    ↓      ↓
O0   O1   O2 ... O49
↓     ↓    ↓      ↓
       Loss

In the proposed method, the user can specify disjoint sub-graphs which form "buckets". For example, in the below case, the user would specify a single disjoint graph, which has three subgraphs corresponding to sequence length 50, 25, and 10.

R0A → R1A → R2A ... R49A
↓     ↓    ↓      ↓
O0A   O1A   O2A ... O49A
↓     ↓    ↓      ↓
       LossA

R0B → R1B → R2B ... R24B
↓     ↓    ↓      ↓
O0B   O1B   O2B ... O24B
↓     ↓    ↓      ↓
       LossB

R0C → R1C → R2C ... R9C
↓     ↓    ↓      ↓
O0C   O1C   O2C ... O9C
↓     ↓    ↓      ↓
       LossC

Just like R0, R1, etc. share parameters in the original, R0A, R1A, R0B ... would share parameters here.

At runtime, the user would specify data as usual, plus exactly one terminal symbol. It's the user's job to split their data into minibatches and determine the appropriate bucket. E.g., one minibatch would be a collection of segments which are all length 11-25, which have been padded to be exactly length 25. Then, to train on a particular batch they would specify data=[w_0_0 ... w_0_24; w_1_0 ... w_1_24; ...], terminal=LossB

Critically, since each minibatch can only be associated with exactly one terminal symbol, the graph optimizer would know that any disconnected symbols will never be used in the same minibatch. This would allow it to be efficient with memory, and specifying multiple buckets (subgraphs) would almost no additional memory compared to just a single graph of the maximum length.

An alternative approach which has been discussed/used by other frameworks is to allow the user to specify a special "end of sequence token" in the data binding, which will terminate the graph early. Unfortunately this does not generalize very well. It would be very difficult/impossible to implement sequence-to-sequence+attention with the “end of sequence token” method, since the target hidden state must be connected to the final source state, and the target sequence must attend to only a portion of the source state. In the above “bucket” method, supporting this type of architecture (or any other) with variable length-sequences comes naturally.

tqchen commented 8 years ago

loop in @pluskid for discussion, who is leading the design of hybrid imperative and symbolic approach for supporting variable length and other patterns.

pluskid commented 8 years ago

@jacobdevlin Thanks for sharing the information. We agree that variable-length sequence modeling is an important part of deep learning libraries and we are currently starting to implement that in MXNet.

The idea you mentioned is very nice. To my understanding, it instantiate several explicit unrolled models of several chosen lengths (e.g. 10, 25, 50), and the appropriate one will be used to handle each mini-batch. We are trying to come up with an API that makes automatic unrolling as easy as possible. Then supporting for this kind of behavior should follows easily.

But I don't understand your comments about "end of sequence token" and you concern that it

would be very difficult/impossible to implement sequence-to-sequence+attention with the “end of sequence token” method

Can you elaborate or provide some pointers? Thanks a lot!

jacobdevlin commented 8 years ago

@pluskid The primary papers that describe the "sequence-to-sequence+attention" papers are here: Neural Machine Translation by Jointly Learning to Align and Translate

Sequence to Sequence Learning with Neural Networks

You can see that they were originally developed for machine translation, but they have since been used for many other tasks (image/video captioning, conversational agents, semantic parsing, etc.)

So for the machine translation case, you have a source sentence (e.g., "el hombre es alto") and a target sentence (e.g., "the man is tall").

The state of the art model is to do the following:

Run a bidrectional RNN over the source:

SF0 → SF1 → SF2 ... SF19
SB0 ← SB1 ← SB2 ... SB19

Where SF refers to the forward source recurrent layer, and SB is the backwards source recurrent layer (these will generally be LSTMs or GRUs).

Then you take the final SF state and final SB state, concatenate them, project them through a fully connected layer, and use that as the initial recurrent state in your target network:

SF19 + SB0 → T'

Next, for each source word, you append these layers and feed them through another fully connected layer:

SF0 + SB0 → S0
SF1 + SB1 → S1
...

Finally you run a target recurrent network, which also computes attention over the source representation S0, S1, ... (basically a softmax that is performed at every step).

Now imagine you were trying to do this to support variable length sequences using an "end of sequence token". So you unroll your network to a max of 50 source words and 50 target words. But for the current batch, you want to limit it to 20 source words and 20 target words. You would have a bunch of problems:

The bidirectional source network would be difficult, because the first forward node actually corresponds to the last backward node (in the order of the dag). So you can't just run the first 20 from SF and the first 20 from SB, because the first 20 from SB actually corresponds to words 30-49, not 0-19.
The part where you take the final SF and final SB is also a problem with early termination. It would be expecting to to be connected to SF49 and SB0. But we actually need it to be connected to SF19 and SB0.
On the target side, you only want to attend to S0 through S19. But it will actually try to attend to S0 through S49. Making it only attend to S0 through S19 would require a lot of special logic.

TensorFlow supports such a model with variable length sequences out of the box without any special logic, using the method that I had described.

I'm pretty sure that MxNet can support such a network also, just not with variable length sequences (Although I haven't figured out whether the attention layer can be implemented in just the python layer or requires its own C++ operator, but that's a separate question).

pluskid commented 8 years ago

@jacobdevlin Sorry for the late reply. Was traveling and trying to build a basic prototype of sequence APIing during the last week. We will discuss to finalize that this weekend. I guess we can discuss about the details during the meeting. But here are some thoughts related to your comments above.

I think your concern that explicitly unrolled models have this issue of having to deal with sequences that are shorter than the unrolled length, especially in attention models and bidirectional models. But I do not understand how the Tensorflow way could resolve this. Based on my understanding of your description above, they explicitly unroll for several possible lengths (5, 10, 15, 25, etc.). But unless they do this for every possible sequence length, there is still cases where the actual sequence length does not match the unrolled length. Right? Are you talking about this officien seq2seq example in tensorflow? I will try to have a look at that before our discussion.

The prototype we come up is a hybrid imperative + symbolic solution with minimum modification to the existing MXNet internals. Basically it uses the Python side to explicitly schedule recurrence and the computation graph is instantiated only for one time step.

We will support automatic unrolling, so if memory is enough, unrolling the symbolic graph for a few time steps would speed up the computation.
While many of the existing RNN toolkits do recurrent scheduling at a layer or operator level, we do scheduling globally for the whole computation graph. We believe this is more efficient and in some way easier to crack within the current architecture of MXNet internals. This means
- The whole network has to have the same recurrence sequence pattern (e.g. forward sweeping through the input sequence, backward sweeping, etc.). This is of course not true for the translation model you described above, because you have a bi-directional recurrence, which collapse into a single output, and then a forward RNN with attention to the original sequence. But this issue could resolved if we allow flexible composition or combination of RNN networks with different scheduling patterns. The actual API for composing networks is still open.
- The layers with recurrence pattern does not need to be an atomic operator (like FullyConnected or Convolution). It could be a sub computation graph that combines existing operators. As a result, it becomes very flexible to construct recurrent "operators" purely in Python by composing a subgraph using symbolic API. So for example, an LSTM cell, or the attention module -- a softmax that summarize the last 3 inputs for example -- could easily be defined in Python.

I'll send the actual meeting time by tonight. We can discuss more details during the meeting. We appreciate a lot all your inputs and feedbacks!

jacobdevlin commented 8 years ago

@pluskid In the TensorFlow model (disjoint graph support) it is still up to the user to bucket their data and pad each segment to the closest bucket. So one minibatch would contain all segments of length 11-25, padded to be exactly 25. In most applications I can think of, this is fine.

The whole network has to have the same recurrence sequence pattern (e.g. forward sweeping through the input sequence, backward sweeping, etc.).

Most architectures which get state-of-the-art results on various NLP tasks (and not just NLP -- speech recognition, video processing, etc) do not have such a simple pattern. We could design an API which works for certain architectures now (like seq2seq+attention), but who knows what type of architectures will become popular in the near future. The only truly general mechanism is to let the user specify whatever graph they want -- which MxNet already lets you do, but only for a single connected graph.

For example, consider the recursive neural network architecture where the network is a binary tree. The leaf nodes are the data, and the intermediate nodes all share the same parameters (it could be a fully connected layer or an LSTM-like layer).

     N
  /    \
  N     N
 / \  /  \
a   b c  d

It would be extremely difficult to design an API that allows for variable length sequences here. But the bucketing method would work naturally.

There are other advantages to the bucketing method (e.g., supporting multiple sub graphs that share parameters), namely, they easily allow for multi-task/multi-modal training. For example, you could train a translation model and a language model jointly and tie certain parameters together. In this case, you would specify two disjoint graphs, and randomly alternate whether the minibatch corresponds to the translation model or the language model. The embedding layer (and possibly other layers) would be tied together.

tqchen commented 8 years ago

closing https://github.com/dmlc/mxnet/blob/master/doc/tutorial/bucketing.md

anjishnu commented 8 years ago

Hi, what happened to this? Also - I could not access the link you posted above.

pallavbakshi commented 6 years ago

Broken link alternative - https://mxnet.incubator.apache.org/faq/bucketing.html

apache / mxnet

Support for variable length sequences during training #1324