apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Higher Level API for RNN #3930

Closed sxjscience closed 7 years ago

sxjscience commented 7 years ago

We've created a higher level API for recurrent neural networks and have completed gradient tests, forward test and speed comparison against CuDNN. The class definition and key methods look like this:

class RNN(object):
    """High level API for constructing stacked RNN layers.

    To use a recurrent neural network, we can first create an RNN object and use the step function
    during the symbol construction.

    Currently four types of RNN are supported and all parameters per layer are grouped into 4 matrices.
    The data layout and transition rules are similar to the RNN API in CuDNN (https://developer.nvidia.com/cudnn)
    1) ReLU RNN:
        h_t = ReLU(W_i x_t + R_i h_{t-1} + b_{W_i} + b_{R_i})

        Parameters:
            W_{i2h} = W_i
            b_{i2h} = b_{W_i}
            W_{h2h} = R_i
            b_{h2h} = b_{R_i}
    2) Tanh RNN:
        h_t = tanh(W_i x_t + R_i h_{t-1} + b_{W_i} + b_{R_i})

        Parameters:
            W_{i2h} = W_i
            b_{i2h} = b_{W_i}
            W_{h2h} = R_i
            b_{h2h} = b_{R_i}
    3) LSTM:
        i_t = \sigma(W_i x_t + R_i h_{t-1} + b_{W_i} + b_{R_i})
        f_t = \sigma(W_f x_t + R_f h_{t-1} + b_{W_f} + b_{R_f})
        o_t = \sigma(W_o x_t + R_o h_{t-1} + b_{W_o} + b_{R_o})
        c^\prime_t = tanh(W_c x_t + R_c h_{t-1} + b_{W_c} + b_{R_c})
        c_t = f_t \circ c_{t-1} + i_t \circ c^\prime_t
        h_t = o_t \circ tanh(c_t)

        Parameters: (input_gate, forget_gate, new_mem, output_gate)
            W_{i2h} = [W_i, W_f, W_c, W_o]
            b_{i2h} = [b_{W_i}, b_{W_f}, b_{W_c}, b_{W_o}]
            W_{h2h} = [R_i, R_f, R_c, R_o]
            b_{h2h} = [b_{R_i}, b_{R_f}, b_{R_c}, b_{R_o}]
    4) GRU:
        i_t = \sigma(W_i x_t + R_i h_{t-1} + b_{W_i} + b_{R_i})
        r_t = \sigma(W_r x_t + R_r h_{t-1} + b_{W_r} + b_{R_r})
        h^\prime_t = tanh(W_h x_t + r_t \circ (R_h h_{t-1} + b_{R_h}) + b_{W_h})
        h_t = (1 - i_t) \circ h^\prime_t + i_t \circ h_{t-1}

        Parameters: (reset_gate, update_gate, new_mem)
            W_{i2h} = [W_r, W_i, W_h]
            b_{i2h} = [b_{W_r}, b_{W_i}, b_{W_h}]
            W_{h2h} = [R_r, R_i, R_h]
            b_{h2h} = [b_{R_r}, b_{R_i}, b_{R_h}]
    """
    def __init__(self, num_hidden, data_dim, typ='lstm',
                 dropout=0., zoneout=0.,
                 i2h_weight=None, i2h_bias=None,
                 h2h_weight=None, h2h_bias=None,
                 init_h=None, init_c=None,
                 cudnn_opt=False,
                 name='LSTM'):
        """Initialization of the RNN object

        Parameters
        ----------
        num_hidden : list or tuple
            Size of the hidden state for all the layers
        data_dim : int
            Dimension of the input data to the symbol
        typ: str
            Type of the Recurrent Neural Network, can be 'gru', 'lstm', 'rnn_relu', 'rnn_tanh'
        dropout : list or tuple, optional
            Dropout ratio for all the hidden layers. Use 0 to indicate no-dropout.
        zoneout : list or tuple, optional
            Zoneout ratio for all the hidden layers. Use 0 to indicate no-zoneout.
        i2h_weight : list or tuple, optional
            Weight of the connections between the input and the hidden state.
        i2h_bias : list or tuple, optional
            Bias of the connections between the input and the hidden state.
        h2h_weight : list or tuple, optional
            Weight of the connections (including gates) between the hidden states of consecutive timestamps.
        h2h_bias : list or tuple, optional
            Bias of the connections (including gates) between the hidden states of consecutive timestamps.
        init_h : list or tuple, optional
            Initial hidden states of all the layers
        init_c : list or tuple, optional
            Initial cell states of all the layers. Only applicable when `typ` is "LSTM"
        cudnn_opt : bool, optional
            If True, the CuDNN version of RNN will be used. Also, the generated symbol could only be
            used with GPU and `zoneout` cannot be used.
        name : str
            Name of the object
        """
    def step(self, data, prev_h=None, prev_c=None, seq_len=1, ret_typ="all"):
        """Feed the data sequence into the RNN and get the state symbols.

        Parameters
        ----------
        data : list or tuple or Symbol
            The input data. Shape: (seq_len, batch_size, data_dim)
        prev_h : list or tuple or Symbol or None, optional
            The initial hidden states. If None, the symbol constructed during initialization
            will be used.
            Number of the initial states must be the same as the layer number,
            e.g, [h0, h1, h2] for a 3-layer RNN
        prev_c : list or tuple or Symbol or None, optional
            The initial cell states. Only applicable when `typ` is 'lstm'. If None,
            the symbol constructed during initialization will be used.
            Number of the initial states must be the same as the layer number,
            e.g, [c0, c1, c2] for a 3-layer LSTM
        seq_len : int, optional
            Length of the data sequence
        ret_typ : str, optional
            Determine the parts of the states to return, which can be 'all', 'out', 'state'
            IMPORTANT!! When `cudnn_opt` is on, only the 'out' flag is supported.
            If 'all', symbols that represent states of all the timestamps as well as
             the state of the last timestamp will be returned,
                e.g, For a 3-layer GRU and length-10 data sequence, the return value will be
                     ([h0, h1, h2], [h0_9, h1_9, h2_9])
                      Here all hi are of shape(seq_len, batch_size, num_hidden[i]) and
                      all hi_j are of shape(batch_size, num_hidden[i])
                     For a 3-layer LSTM and length-10 data sequence, the return value contains both state and cell
                     ([h0, h1, h2], [c0, c1, c2], [h0_9, h1_9, h2_9], [c0_9, c1_9, c2_9])
            If 'out', state outputs of the layers will be returned,
                e.g, For a 3-layer GRU/LSTM and length-10 data sequence, the return value will be
                     [h0, h1, h2]
            If 'state', last state/cell will be returned,
                e.g, For a 3-layer GRU and length-10 data sequence, the return value will be
                     [h0_9, h1_9, h2_9]
                     For a 3-layer LSTM and length-10 data sequence, the return value will be
                     ([h0_9, h1_9, h2_9], [c0_9, c1_9, c2_9])

        Returns
        -------
        tuple
            States generated by feeding the data sequence to the network.

            If the `return_all` flag is set, states of all the timestamps will be returned.
            Otherwise states of all the timestamps will be returned.

        """

We decide to Pull Request this feature @leezu .

Should we create a new directory under "python/mxnet" like "operators" to store these kind of composed symbols? What do you think? @pluskid @piiswrong @tqchen @xlvector @sbodenstein

pluskid commented 7 years ago

Great! Thanks a lot! We should long have this built in the standard package instead of bare-bone unroll function in examples. Currently since there are not going to be a lot of different Python composed symbols in the standard library, I guess simply put it as python/mxnet/rnn.py would be fine?

Also if I understand correctly, the class RNN is a constructor, whose step function needs to be called to compose a symbol, right? This is different convention from the cudnn RNN cell, as the cudnn RNN cell constructor itself is a composition function. Maybe we need to double think about the naming here to avoid confusion between the two cases.

sxjscience commented 7 years ago

@pluskid Yes, the RNN class here is a symbol constructor. I'm also thinking about the naming issue. May be call it "RNNFactory" in order to distinguish from the cudnn version?

sxjscience commented 7 years ago

@leezu @jennyzhang0215 Let's PR by this weekends(Dec 4th) and add some examples.

xlvector commented 7 years ago

Great!

I remember CUDNN RNN cell need to transpose data before input, so what the input shape of this operator? And I find current NON-CUDNN version will also use 30%~50% GPU-Util, and its not easy to use 100% GPU. You mentioned you have done speed test against CUDNN version, is there any reports?

sxjscience commented 7 years ago

@xlvector I find that cudnn will be 3/6 times faster than the original implementation. The input shape is chosen to be the same as cudnn, i.e, (seq_len, batch_size, data_dim)

leezu commented 7 years ago

The iterators defined in https://github.com/dmlc/mxnet/blob/master/example/rnn-time-major/bucket_io.py are helpful to more easily get the correct (time-major) input shape. One could also add (a more general version) to https://github.com/dmlc/mxnet/blob/master/python/mxnet/io.py . What do you think?

zhenlinluo commented 7 years ago

My understanding is that RNN input and output shape is already defined in infershape API in RNN-inl.h. So I am using them in my RNN implementation.
// data: [sequence len, batch, input dimension] // Hidden shape is dim [total_layers, batch, state_size] // Cell shape is dim [total_layers, batch, state_size] // output: [sequence len, numdirection, num_direction * state_size] // outStateShape: [layer_num, batch, state size]

piiswrong commented 7 years ago

@mli

sbodenstein commented 7 years ago

@sxjscience: you should be explicit by what dropout you are referring to. Is it the version from " A Theoretically Grounded Application of Dropout in Recurrent Neural Networks" (https://arxiv.org/pdf/1512.05287.pdf), or from the older and less good (arXiv:1409.2329v5)? (or support both as options)

sxjscience commented 7 years ago

@sbodenstein Currently we've just coded the old version. The dropout method in the NIPS paper should be added later. Also, the performance boost in the paper is partially due to the dropout method for the embedding layer and we can also support it.

sxjscience commented 7 years ago

@zhenlinguo The problem may be that by doing this we cannot support stacked RNNs with different number of hidden state sizes. The API will be easier to use if we separate the weight, bias and states.

zhenlinluo commented 7 years ago

Hi all, I have one question, if I run RNN layer and return is combination of kOut, kStateOut, kStateCellOut, then in python, how I can get the topright of kStateOut and kStateCellout? What structure or API I can use? In rnn_cell_demo, this sentence is used after call RNN. hidden = mx.sym.Reshape(data=rnn, shape=(-1, num_hidden))

But I don't quite understand it. For seq2seq, h, y, c will be output, but how I can just get topright of h and c from stream?

magic282 commented 7 years ago

@sxjscience Hi, What do you mean by have "3/6 times faster" speed up with CUDNN? 3-6 times speed up?

sxjscience commented 7 years ago

@magic282 Yes, cudnn is faster.

magic282 commented 7 years ago

@sxjscience I strongly suggest that we have a benchmark for RNN related models, such as RNN LM, s2s, and attention, comparing with popular Theano and TF. Actually I have implemented s2s+attention NMT model with mxnet and got state-of-the-art result on IWSLT data. However, the training speed is not very fast. According to this benchmark https://github.com/guolinke/deep-learning-benchmarks/blob/master/result.md , mxnet is faster than other tools for the FCN models but slower for LSTM. I am really curious about the reason. But this is just a reference since I don't really believe that Theano is much faster than Torch.

And about the CUDNN speed up. Our group has built a DL framework from scratch and we are planning to leverage CUDNN to speed up training. We can only get 2 times speed up at most since our tool is already very fast for RNN models (because of some optimization for RNN). So I am thinking that if mxnet can have some optimization for RNN models without the support of CUDNN?

sxjscience commented 7 years ago

@magic282 Yes, we need to include such a benchmark. Would you mind sharing your implementation? In fact, we haven't got an s2s + attention example yet. It will be easier for us to investigate whether the speed is not that satisfactory if we have the example code.

magic282 commented 7 years ago

@sxjscience Sure, I will refactor the code and share it and hope that we can speed it up.

zhenlinluo commented 7 years ago

@magic282 , do you have early code which call RNN layer rather than lstm_unroll python? I already implemented mkl based RNN for CPU and going to release later. But I am lack of s2s model to test the perf.

magic282 commented 7 years ago

@zhenlinluo Nope.

zhenlinluo commented 7 years ago

@magic282 since you have s2s model based on cudnn RNN layer, could you pls help to answer my question about what need to be returned after call mx.sym.RNN(xxxx, state_output=True) in encoder and decoder func? Since output Tblob will include kOut, kStateOut and kCell, but I just need to return 2D array kStateOut and kCell on top-right cell as input to decoder, how to do that?

magic282 commented 7 years ago

@zhenlinluo I don't have s2s base on cudnn RNN symbol since I implemented LSTM or GRU using the basic OP.

zhenlinluo commented 7 years ago

@mli @piiswrong Do you know how to get specific data when outputs are multiple in python?

magic282 commented 7 years ago

@sxjscience Hi, I have uploaded the s2s + attention code. It might have some bugs since this time I can only get 44 BLEU score. Link, https://github.com/magic282/MXNMT

sxjscience commented 7 years ago

@magic282 Is it possible to revise the code to run the WMT14 or WMT15 dataset? We'd better do pair-test with TensorFlow.

magic282 commented 7 years ago

Actually I am not very familiar with MT task. Could you please provide the data format? Is it the reference number issue?

sxjscience commented 7 years ago

@magic282 We may need to refer to their seq2seq example. https://www.tensorflow.org/versions/r0.12/tutorials/seq2seq/index.html

goodmansasha commented 7 years ago

Regarding this discussion (re: Yarin Gal's RNN implementation):

@sxjscience: you should be explicit by what dropout you are referring to. Is it the version from " A Theoretically Grounded Application of Dropout in Recurrent Neural Networks" (https://arxiv.org/pdf/1512.05287.pdf), or from the older and less good (arXiv:1409.2329v5)? (or support both as options)

@sbodenstein Currently we've just coded the old version. The dropout method in the NIPS paper should be added later. Also, the performance boost in the paper is partially due to the dropout method for the embedding layer and we can also support it.

According to Yarin, his implementation is already in Keras, TensorFlow, and Torch. I think it basically uses the same dropout mask across timesteps on RNNs. The implementations might include examples to handle RNN layers of different sizes, and I also speculate the potential of allowing for uncertainty in RNN predictions as that research continues.

See: https://github.com/yaringal/BayesianRNN/issues/3

sxjscience commented 7 years ago

@predict-r Thanks very much! I'm busy with my PQE exam and need to work on this after I finish the exam. I find that the "dropout" used in the paper is actually a type of "DropConnect"(http://www.jmlr.org/proceedings/papers/v28/wan13.pdf) that directly masks the weight matrix. I'm thinking to keep its original name.

magic282 commented 7 years ago

@sxjscience Sorry for late reply. I think the code will run if we have parallel corpus. Did it fail when you try it?

goodmansasha commented 7 years ago

@sxjscience It might be called DropConnect, but apparently its even older. Here is Yann LeCun's writeup on the history: https://www.facebook.com/yann.lecun/posts/10154058859142143 .

karishmamalkan commented 7 years ago

@sxjscience Hey, what is the progress on the API for RNN? Is it complete or still in progress?

sxjscience commented 7 years ago

@karishmamalkan Still in progress, decide to add also the example of batchnorm. Will finish it after I finish the PQE. You can view some codes here. https://github.com/ML-HK/mxnet/blob/master/python/mxnet/recurrent.py

karishmamalkan commented 7 years ago

Thanks @sxjscience I wanted to know is this a working version of the code. When i try to import recurrent.py, i get an error about importing "utils".. Is something missing?