ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.64k stars 9.58k forks source link

Variable density context windows? #660

Closed GeorgeUCB closed 6 months ago

GeorgeUCB commented 1 year ago

I am currently off my meds, but I would like to propose an idea to enhance the context handling capabilities of the LLaMA/Alpaca/gpt4all models, particularly for the smaller models with limited context window sizes. I'm not sure if this is entirely doable within the current architecture, or if changes would be needed to the underlying LLMs, but I wanted to share my thoughts and get your feedback.

The Problem:

As you all know, smaller models have limited context window sizes, which make it hard to maintain long conversations especially when the LLM is used as a chatbot instead vs unrelated queries. This limitation affects the model's overall performance and ability to provide accurate and coherent responses.

The Proposal:

I propose implementing a Variable Density Context Window (VDCW) technique that selectively retains the most relevant tokens while still staying within the model's limited context window size. This approach aims to provide the model with a more extensive and meaningful context to work with, even with smaller window sizes.

To make this more concrete, I suggest dividing the context window into three sections: [Character Card], [Old Conversation], and [Recent Conversation]. This will enable us to focus on the most important aspects of the conversation while still retaining some critical context from the old conversation.

[Character Card]: This section contains essential information about the user and the AI model's identity, role, and any other relevant background information.

[Old Conversation]: This section will store a compressed version of the earlier parts of the chat history, focusing on extracting verbs and nouns from the text (from some quick experimenting this reduces token use by ~60%), along with any critical context provided by certain tokens (e.g., 'User:', 'Miku:'). This way, we can retain important context without using too many tokens.

[Recent Conversation]: This section will hold the most recent conversation, which is crucial for understanding the current context and providing accurate responses.

In addition to these three sections, VDCW could also employ the following techniques in the future (god knows I probably cant code them):

Importance sampling: Sample tokens based on their relevance or importance to the current task, such as token frequency or relevance to the query. Compression: Compress context by merging or summarizing multiple tokens or phrases to convey necessary information with fewer tokens. Hierarchical representation: Store information hierarchically to access a wider range of information without increasing the context window size.

I'm looking forward to hearing your thoughts and insights on this proposal. If you think this idea has potential, we can start discussing the implementation details and potential roadblocks. If you have any concerns or alternative suggestions, please feel free to share them. Let's work together to enhance LAMA's context handling capabilities!

*A lot of this was summarized from my conversations with the help of chatGPT.

chrfalch commented 1 year ago

This is already implemented with the n_keep parameter and the current infinite context where the “character card” is kept (using n_keep), and the rest of the history is split in two and the newest part is kept when moving forward.

The problem with adding things like summarization is that the context is a bit too small to have room for these techniques.

GeorgeUCB commented 1 year ago

Then it would appear the behavior I'm seeing with every model I use is a bug. I posted this in the discussion https://github.com/ggerganov/llama.cpp/discussions/645 . Basically all my chats past a certain number of tokens appear to lose all context including the character card. After that point there is also a 50/50 chance that they just start repeating the same thing over and over again. At first I thought it was a memory issue within WSL but I've since assigned it 80gb of system memory to no benefit.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.