Open liziru opened 3 years ago
hi @liziru , thanks for posting the great questions.
(1) Difference of the two versions. In our first arxiv version, we didn't include the element-wise hidden-to-hidden connection (v * c_{t-1}
) for simplicity. It is easier to derive the backward computation in this case. We added the hidden-to-hidden connection to obtain additional modeling capacity and better empirical results in our second version (EMNLP submission). See for example a comparison of two variants in Table 5 and Figure 5 in the EMNLP paper.
(2) Parallelization. There are two different parallelization here. First, the matrix multiplications in (1)-(4) can be grouped into a single one and therefore be parallelized across both time step and hidden dimension. Second, you are correct that point-wise computation cannot be parallelized across time step. But they can be parallelized across hidden dimension since each dimension is independent. In other words, to compute c[t][i] you need to first compute c[t-1][i]; but you can compute c[*][i] and c[*][j] in parallel for i != j. Note the second parallelization is not available in LSTM / GRU since in these RNNs the hidden-to-hidden connections are fully connected.
Hope this helps.
hi @liziru , thanks for posting the great questions.
(1) Difference of the two versions. In our first arxiv version, we didn't include the element-wise hidden-to-hidden connection (
v * c_{t-1}
) for simplicity. It is easier to derive the backward computation in this case. We added the hidden-to-hidden connection to obtain additional modeling capacity and better empirical results in our second version (EMNLP submission). See for example a comparison of two variants in Table 5 and Figure 5 in the EMNLP paper.(2) Parallelization. There are two different parallelization here. First, the matrix multiplications in (1)-(4) can be grouped into a single one and therefore be parallelized across both time step and hidden dimension. Second, you are correct that point-wise computation cannot be parallelized across time step. But they can be parallelized across hidden dimension since each dimension is independent. In other words, to compute c[t][i] you need to first compute c[t-1][i]; but you can compute c[][i] and c[][j] in parallel for i != j. Note the second parallelization is not available in LSTM / GRU since in these RNNs the hidden-to-hidden connections are fully connected.
Hope this helps.
Thanks for your reply. @taolei87
Is this GitHub implementation the first arxiv version or the second version (EMNLP submission)?
I want to do some tests on the first version due to some reasons. Where can I find the implementations of the first version?
I will appreciate it very much.
@liziru It is version 2 by default. But you can test v1 by passing v1 = True
in SRU
or SRUCell
constructor.
See https://github.com/asappresearch/sru/blob/master/sru/modules.py#L444
@liziru It is version 2 by default. But you can test v1 by passing
v1 = True
inSRU
orSRUCell
constructor. See https://github.com/asappresearch/sru/blob/master/sru/modules.py#L444
Great! Thank you very much.
Thanks for your job. @taoleicn @taolei87
In the paper 'Simple Recurrent Units for Highly Parallelizable Recurrence', I found the following computation:
I have no idea how to become parallelizable when using a point-wise multiplication dimension of the state vectors instead. Because of still using ct-1, I think this ‘sru’ cannot be parallelizable.
Besides, I also found 'Training RNNs as Fast as CNNs’ computation that I think it can be parallelizable.
And, what are the differences between these two papers? I also mentioned this question in Zhihu.
Looking forward to your reply.