Closed agokrani closed 8 months ago
Hi! Thanks for sharing this! We also found what you mentioned "The outputs with in context size are even better at following instruction than actual model".😄 Considering Phi-2's context windows is only 2k, we suggest use 386-1024 as the neighbor window. Group size 3~8 should be good. Maybe larger group can work better. We also implemented phi-2 by ourself with KV cache, you may check it. But we haven't finished the test!
Hey @Mooler0410,
Cool. At least this gives me hope that my implementation is in the right direction. Looking at the implementation of Phi2, the new group values are not added to the cache, I was actually thinking about adding this by extending the forward of the PhiModel class and using custom cache class. Apparently they don't do anything with the cos sin values they store in the cache. Maybe I am wrong, what do you think?
You are correct! We are too lazy to change the new cache class.
Great 👍 will give this a shot. If I am successful, I will make a pull request.
Hi,
Loved this paper and implementation. I implemented this for Phi2 with transformers==4.36.2 without caching. The outputs with in context size are even better at following instruction than actual model. However, when going out of context window, I am seeing a repetition. This might be due to extending context window size bit too much. Do you guys have suggestions for experimenting with different group and neighbour sizes or any insights.
Here is my implementation: https://github.com/agokrani/LongLM/tree/phi2
I haven't implemented KV caching for now due to change in KV caching format in transformers. I will try to implement it soon. Would love to hear your thoughts.