Open egormalyutin opened 1 month ago
From my understanding, we convolute over the entire sequence and not just the last 4 inputs. I'm sure there are many different ways to interpret the paper so I am open to discussing alternatives and trying them out.
Hello! Thank you for your implementation. However, I have a little question regarding the use of convolution.
Does it mean that in your implementation, you are not actually doing convolution over sequence? After reading the paper, I was left with an impression that you should project 4 latest inputs and then apply
swish
to obtaini
andf
. Sorry if I'm wrong as I don't really use PyTorch.