Jamie-Stirling / RetNet

An implementation of "Retentive Network: A Successor to Transformer for Large Language Models"
MIT License
1.16k stars 100 forks source link

About the complex #1

Closed KohakuBlueleaf closed 1 year ago

KohakuBlueleaf commented 1 year ago

Sorry for bothering you and this may be a dumb question: The Complex type in here is for what?

I'm not very good at math and if you guys can explain why we need to use complex it will be good.

Jamie-Stirling commented 1 year ago

Hi KohakuBlueleaf,

Please refer to equation (3) in the original paper. The authors introduce an imaginary coefficient (i theta) which implies values derived from this term are also complex. An advantage of complex types is that a scalar can encode rotation information as well as magnitude, however it also means everything uses double the memory.

There's no discussion of whether an implementation should use complex types, but this implementation does because it closely matches the mathematical formulation in the original paper.

I know there's probably a way to implement this without complex types and would be grateful for any further insight from anyone who understands this.

Fr0do commented 1 year ago

I guess complex exponent is not evaluated implicitly, rather Euler's formula with cosine and sine real counterparts is used as in xPos paper, imaginary part only responsible for rotation of inputs. I suppose xPos original implementation would be used by the authors. https://github.com/microsoft/torchscale/blob/main/torchscale/component/xpos_relative_position.py

KohakuBlueleaf commented 1 year ago

I guess complex exponent is not evaluated implicitly, rather Euler's formula with cosine and sine real counterparts is used as in xPos paper, imaginary part only responsible for rotation of inputs. I suppose xPos original implementation would be used by the authors. https://github.com/microsoft/torchscale/blob/main/torchscale/component/xpos_relative_position.py

Agree with you

Jamie-Stirling commented 1 year ago

I guess complex exponent is not evaluated implicitly, rather Euler's formula with cosine and sine real counterparts is used as in xPos paper, imaginary part only responsible for rotation of inputs. I suppose xPos original implementation would be used by the authors. https://github.com/microsoft/torchscale/blob/main/torchscale/component/xpos_relative_position.py

Would this require double the number of components (one for real, one for imaginary) if using real vectors to represent the data?