Closed Jamie-Stirling closed 11 months ago
@Jamie-Stirling https://github.com/microsoft/unilm/issues/1213
You could also refer to https://github.com/microsoft/torchscale/commit/bf65397b26469ac9c24d83a9b779b285c1ec640b
@donglixp Thanks so much for your comment, it was critical to solving this issue.
There was also another term that is omitted in the paper in equation (7) but is otherwise present in the torchscale implementation. Please see line 85 of retention.py
:
r_i =(K.transpose(-1, -2) @ (V * D[-1].view(1, chunk_size, 1))) + (self.gamma ** chunk_size) * r_i_1
In particular:
D[-1].view(1, chunk_size, 1)
@donglixp Thanks so much for your comment, it was critical to solving this issue.
There was also another term that is omitted in the paper in equation (7) but is otherwise present in the torchscale implementation. Please see line 85 of
retention.py
:r_i =(K.transpose(-1, -2) @ (V * D[-1].view(1, chunk_size, 1))) + (self.gamma ** chunk_size) * r_i_1
In particular:
D[-1].view(1, chunk_size, 1)
Equation(7) of the latest arXiv paper ( https://arxiv.org/pdf/2307.08621v4.pdf ) fixed the issue.
The implementation of chunkwise retention paradigm on the chunkwise-real branch gives different outputs to the other two paradigms.
It appears there may be a mistake in the paper on which the implementation was based, in equation (7). A pull request fixing this and obtaining outputs consistent with the other two paradigms would be greatly appreciated.
This can be reproduced by running `python src/tests.py', with stdout: