Open Jerry-Master opened 10 months ago
Unlike LSTM, GRU by design does not have separate hidden and forward output. They share the same. See this diagram.
The (1 - z)
was opposite to the paper notation but they are equivalent.
So I believe my original implementation was correct.
I mean, you say in the article ot is the output of the layer and h is the hidden state. So it makes sense that you pass the output to the next layer and the hidden state to the next time step. I was wondering if you tried, or have some intuition on which option is better in performance because, computationally, they are very similar.
Looking at your formulas in the article I see your implementation of the GRU does not coincide with the code you provide. I don't want you to merge this fork since it would break compatibility. But I leave it here in case you want to discuss the performance of this fixed ConvGRU implementation. It seems you are recycling the hidden states as if it was the forward activation. It is a valid approach, but I see more reasonable to separate between hidden state and forward activation.