Beomi / InfiniTransformer

Unofficial PyTorch/🤗Transformers(Gemma/Llama3) implementation of Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
https://arxiv.org/abs/2404.07143
MIT License
336 stars 29 forks source link

Model loses information very quickly #25

Open Lazy3valuation opened 4 months ago

Lazy3valuation commented 4 months ago

Hi! I trained the model with LoRA and 8 bit precision down to 1.5/2.5 training loss. The generation is segment-wise, but the model seems to not generate correct text. It cannot perform a needle-in-a-sack test even in small tests (less tokens than the segment size, aka 400 for me). It starts to spit out nosense very quickly. For example: I've tried a NIAS test with this pattern: "There is an important info hidden inside a lot of irrelevant text. Find it and memorize it. I will quiz you about the important information there." Then a loop of "\nThe grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is blue. The sun is yellow. Here we go. There and back again." continues for many times (I've repeated it as long as to reach 400 tokens, 3600 tokens and 10k tokens). Inside the loop, in a random position, there's a "\nThe pass key is 72498. Remember it. 72498 is the pass key.". In the end of the prompt, there's written "What is the pass key? The pass key is " and the base model completes correctly with 72498, up until 3600 tokens (then my GPU goes oom).

With infini attention, the model can't complete correctly even once. Moreover, the pattern repeated many times gets "broken", here's a completion example: " The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is bluer. The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is bluer. The sun is yellow. Here we go. There and back again.\nThe grass is green. The sky is blue. The sun is yellow. Here we will. They will be a bit of the distance, at least we"

It behaves as if the model can't keep information at all, or for a very short amount of time. Has anyone tested how good those models go? I sadly noticed that the repo has not been updated in a month :-(

Thirvin commented 4 months ago

I train Infini-llama with arxiv-paper. The result is alike to yours. It can't handle the attention compressed in memory. Its outputs have little relation to the content I prvided.

LWL-cpu commented 3 months ago

I also encountered a similar issue. For example, for the question-answering task, I input the passage into the model and save the memory, then provide the model with the question query along with the passage memory. However, the model seems to just repeat my query and is unable to answer the question.