Closed draguve closed 9 months ago
Thanks very much for this!
I've changed the target branch to a new branch einops
for the sake of maintaining the simplicity/clarity of the main implementation, while allowing the option of a faster implementation for those who need it.
I rewrote the forward parallel function of MultiScaleRetention to make it so that all the matrix multiplications of each of the heads happen at the same time instead of in serial. I see a speed up about 5x while training.
for some of the operations i used the einops package.