Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ?

liziru commented 3 years ago

Thanks for your job. @taoleicn @taolei87

In the paper 'Simple Recurrent Units for Highly Parallelizable Recurrence', I found the following computation:

To facilitate parallelization, our light recurrence component uses a point-wise multiplication vt*ct-1 instead. With this simplification, each dimension of the state vectors becomes independent and hence parallelizable.

I have no idea how to become parallelizable when using a point-wise multiplication dimension of the state vectors instead. Because of still using ct-1, I think this ‘sru’ cannot be parallelizable.

Besides, I also found 'Training RNNs as Fast as CNNs’ computation that I think it can be parallelizable.

And, what are the differences between these two papers? I also mentioned this question in Zhihu.

Looking forward to your reply.

taolei87 commented 3 years ago

hi @liziru , thanks for posting the great questions.

(1) Difference of the two versions. In our first arxiv version, we didn't include the element-wise hidden-to-hidden connection (v * c_{t-1}) for simplicity. It is easier to derive the backward computation in this case. We added the hidden-to-hidden connection to obtain additional modeling capacity and better empirical results in our second version (EMNLP submission). See for example a comparison of two variants in Table 5 and Figure 5 in the EMNLP paper.

(2) Parallelization. There are two different parallelization here. First, the matrix multiplications in (1)-(4) can be grouped into a single one and therefore be parallelized across both time step and hidden dimension. Second, you are correct that point-wise computation cannot be parallelized across time step. But they can be parallelized across hidden dimension since each dimension is independent. In other words, to compute c[t][i] you need to first compute c[t-1][i]; but you can compute c[*][i] and c[*][j] in parallel for i != j. Note the second parallelization is not available in LSTM / GRU since in these RNNs the hidden-to-hidden connections are fully connected.

Hope this helps.

liziru commented 3 years ago

hi @liziru , thanks for posting the great questions.

(1) Difference of the two versions. In our first arxiv version, we didn't include the element-wise hidden-to-hidden connection (v * c_{t-1}) for simplicity. It is easier to derive the backward computation in this case. We added the hidden-to-hidden connection to obtain additional modeling capacity and better empirical results in our second version (EMNLP submission). See for example a comparison of two variants in Table 5 and Figure 5 in the EMNLP paper.

(2) Parallelization. There are two different parallelization here. First, the matrix multiplications in (1)-(4) can be grouped into a single one and therefore be parallelized across both time step and hidden dimension. Second, you are correct that point-wise computation cannot be parallelized across time step. But they can be parallelized across hidden dimension since each dimension is independent. In other words, to compute c[t][i] you need to first compute c[t-1][i]; but you can compute c[][i] and c[][j] in parallel for i != j. Note the second parallelization is not available in LSTM / GRU since in these RNNs the hidden-to-hidden connections are fully connected.

Hope this helps.

Thanks for your reply. @taolei87
Is this GitHub implementation the first arxiv version or the second version (EMNLP submission)?
I want to do some tests on the first version due to some reasons. Where can I find the implementations of the first version? I will appreciate it very much.

taolei87 commented 3 years ago

@liziru It is version 2 by default. But you can test v1 by passing v1 = True in SRU or SRUCell constructor. See https://github.com/asappresearch/sru/blob/master/sru/modules.py#L444

liziru commented 3 years ago

@liziru It is version 2 by default. But you can test v1 by passing v1 = True in SRU or SRUCell constructor. See https://github.com/asappresearch/sru/blob/master/sru/modules.py#L444

Great! Thank you very much.

asappresearch / sru

Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ? #157