clab / dynet

DyNet: The Dynamic Neural Network Toolkit
Apache License 2.0
3.42k stars 704 forks source link

Highway RNN builder #1198

Open danielhers opened 6 years ago

danielhers commented 6 years ago

A highway LSTM (Srivastava et al., 2015; Zhang et al., 2016; He et al., 2017) is similar to a deep BiLSTM, but each layer's output is linearly combined with its input before being passed on to the next layer, where the combination is determined by a gate depending on the previous time step and the previous layer's output. This alleviates vanishing gradients along the layers. A highway RNN builder shouldn't be much harder than the existing BiRNN builder to implement.

tzshi commented 6 years ago

I was looking for RHN implementation within dynet and found this issue already open.

The commit by @danielhers runs with the current dynet version with little adaptation, but my problem is: how to use dropout.

The highway connection gate takes input from x and h, which should share the same dropout masks as inside LSTMs, and I don't see an apparent way to share the masks. Any suggestions?

danielhers commented 6 years ago

Hi @tzshi, please see my current implementation here. It's part of a larger data structure so it can't really be used as-is in any other project, but the transduce function demonstrates applying the dropout mask properly.

tzshi commented 6 years ago

Thanks for the response @danielhers L186-188 gives a great demonstration of how to apply dropout mask uniformly across time steps. It think you might also want to move that to somewhere inside the loop of L181, as application of dropout is probably needed at every layer.

Ideally, one might want to share the dropout masks on input x with those inside LSTMs, but it might not turn out to be too much different empirically.

danielhers commented 6 years ago

Right, I do need to move this inside the loop. Thanks! Why is it necessary to share the mask with the inputs, though? In recurrent dropout the mask is shared across timesteps, but not across layers.

tzshi commented 6 years ago

Sorry I should have made it clearer that by "input" I meant input to each layer. I was thinking that in the original Gal dropout paper, equation 7 says that all masks to x and to h are shared across timesteps and across the gates. Highway networks just add an additional gate.

BTW, I was checking the details in He et al., 2017, and I found out that the implementation of highway networks should be more involved. Check their equations 12 to 14. The additional gate is computed with h{t-1} and x{t}. Then using that we may compute h{t}. And that h{t}, with highway connections, should participate in calculations of next timestep, not only next layer.

danielhers commented 6 years ago

Well, I'm actually applying dropout just to x but not to h_{t-1}, I think. But He et al., 2017 did the same, judging by their equation 16. As for the additional gate, I think I see what you mean. Since I'm calculating the whole r sequence and only then the whole h sequence, I'm actually using h_{t-1}^\prime and not h_{t-1} to calculate h_{t}, right? I'm referring to these lines:

hs = self.params["rnn%d%s" % (i, n)].initial_state().transduce(xs[::d])
rs = [dy.logistic(Wr * dy.concatenate([h, x]) + br) for h, x in zip(hs[:-1], xs[::d][1:])]
xs = [hs[0]] + [dy.cmult(r, h) + dy.cmult(1 - r, Wh * x) for r, h, x in zip(rs, hs[1:], xs[::d][1:])]

So instead it should be something like:

hs_ = self.params["rnn%d%s" % (i, n)].initial_state().transduce(xs[::d])
hs = [hs_[0]]
for t in range(1, len(hs_)):
    r = dy.logistic(Wr * dy.concatenate([hs[t - 1], xs[t]]) + br)
    hs.append(dy.cmult(r, hs_[t]) + dy.cmult(1 - r, Wh * xs[t]))
xs = hs
tzshi commented 6 years ago

As for the dropout, I think He et al. didn't consider dropout to input. But if you follow their dropout, you'll see that the same mask on h (Equation 16) is used in calculating all the gates, Equations 3, 4, 5, 12, as well as 6.

For the second part, I was referring to that maybe we could not directly use the transduce function, but we need to modify what's happening inside, as the h_t we get from Equation 14 is going to participate in the recurrent calculations in Equations 3-8 (and 12-14 of course).

danielhers commented 6 years ago

I see, so it's not just the r gate but all gates that need to be updated alongside with h and not in bulk. I guess that does require going into the LSTM, which I was trying to avoid...

On Wed, Apr 25, 2018 at 9:33 PM Tianze Shi notifications@github.com wrote:

As for the dropout, I think He et al. didn't consider dropout to input. But if you follow their dropout, you'll see that the same mask on h (Equation 16) is used in calculating all the gates, Equations 3, 4, 5, 12, as well as 6.

For the second part, I was referring to that maybe we could not directly use the transduce function, but we need to modify what's happening inside, as the h_t we get from Equation 14 is going to participate in the recurrent calculations in Equations 3-8 (and 12-14 of course).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clab/dynet/issues/1198#issuecomment-384389877, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQEw8hN4RFjrFmttOTcdYtonXz_h-H5ks5tsMGVgaJpZM4RmaZz .