byroneverson / llm.cpp

Fork of llama.cpp, extended for GPT-NeoX, RWKV-v4, and Falcon models
MIT License
28 stars 2 forks source link

Context not carried over in chat #4

Closed mashdragon closed 1 year ago

mashdragon commented 1 year ago

I am not sure if this is how it is supposed to work, but no context is carried over to subsequent messages. For example, when running ./scripts/chat-pythia-12B.sh.

You can do the context yourself by inserting the <|endoftext|><|assistant|> tokens. I don't think they are true tokens in fact if you do this, so the results from the assistant can sometimes reproduce the literal <|endoftext|> and <|assistant|>. But I think the goal is to have a coherent dialogue automatically?

For example:

> Remember this phrase: Cindy is a farmer. This phrase means that the person named Cindy is a farmer who is skilled at farming and can grow various types of plants and fruits.

> What is Cindy's profession? I'm sorry, I don't know who Cindy is. Could you please provide more context or details about the person or the work you're referring to? It would be helpful for me to understand what you mean by "Cindy's profession".

> Remember this phrase: Cindy is a farmer.<|endoftext|><|assistant|>Cindy is a farmer.<|endoftext|><|prompter|>What is Cindy's profession? He is a farmer.

byroneverson commented 1 year ago

I am currently working on a context-based short term memory to be used with llama/gptneox based models. Its pretty much done so I should have a test version pushed in less than 12 hours. Testing it with the Open Assistant models seems to be a little tricky as they went out of their way to train the models to think that they specifically do not have any sort of short term memory, only the information that is available to them during training. So the model will tell me it does not have a memory and then proceed to use the context memory properly if I ask it to take a guess about something I said previously.

The way that I have it implemented is it will keep adding queries and responses to the context up to the context length (2048 or 4096 usually depending on which model is used). Once the context is full, it will purge the oldest query or response to make room for new next tokens as needed, but I have found that it takes a pretty decent amount of normal conversation before the context actually gets full and needs to be purged.

byroneverson commented 1 year ago

main-oasst has been updated to have basic context memory, running the chat scripts for the Open Assistant models should now work just fine in terms of short term memory.