Open iactix opened 1 year ago
I have now tried this with echo=True and it seems to me that echo doesn't even work with streaming. That would make this more serious than I thought.
@iactix can you share some prompts / models so I can repro this? Yes there's currently a bug but that impacts non-utf-8-printable tokens in streaming mode. As for echo though, that's just an OpenAI compatibility idiosynchrasy, their API afaik does not support echo for streamed outputs.
from llama_cpp import Llama
model = "F:/models/airoboros-7b-gpt4-1.2.ggmlv3.q3_K_M.bin"
prompt = "USER:\nWrite 5 emojis, each followed by a short description\nASSISTANT:\n"
llm = Llama(model_path=model, n_ctx=512, last_n_tokens_size=256, n_threads=4, n_gpu_layers=0)
result = llm.create_completion(prompt, repeat_penalty=1.1, max_tokens=256, stop=["USER:", "ASSISTANT:"], echo=False, temperature=0, mirostat_mode = 2, mirostat_tau=4.0, mirostat_eta=1.1)
print ("stream = False")
print(result['choices'][0]['text'])
stream = llm.create_completion(prompt, stream=True, repeat_penalty=1.1, max_tokens=256, stop=["USER:", "ASSISTANT:"], echo=False, temperature=0, mirostat_mode = 2, mirostat_tau=4.0, mirostat_eta=1.1)
result = ""
for output in stream:
result += output['choices'][0]['text']
print ("stream = True")
print(result)
Without streaming:
With streaming:
Edit: Oh, I meant to set mirostat_eta to 0.1, but hey, the example with 1.1 works.
I thinks this is the same issue on below. There is a problem on decoding multi-byte character on streaming mode. https://github.com/abetlen/llama-cpp-python/issues/286
Having a look-see it seems to me that the problem is calling .decode("utf-8", errors="ignore")
on single tokens bytes, since when stream=True
it yields completion chunks per-token, and Unicode characters are often composed of multiple tokens the utf-8 decode fails. A naïve solution would be to include the raw tokens alongside the decoded text, and to allow the caller to handle what to do with the raw tokens, if anything.
https://github.com/abetlen/llama-cpp-python/blob/3e7eae479631890196823324e0573416408f52a0/llama_cpp/llama.py#L1047 https://github.com/abetlen/llama-cpp-python/blob/3e7eae479631890196823324e0573416408f52a0/llama_cpp/llama.py#L1063
A naïve solution would be to include the raw tokens alongside the decoded text, and to allow the caller to handle what to do with the raw tokens, if anything.
A less naive solution would be to have a tiny internal buffer that accumulates output until a full utf-8 sequence can be decoded, and only then it will transfer that to the actual output and trigger a streaming update. UTF-8 encoding should allow to differentiate between a character that is still incomplete, and one that was somehow invalid. That way it could be ensured that a broken character can not affect generation after that.
a tiny internal buffer that accumulates output until a full utf-8 sequence can be decoded, and only then it will transfer that to the actual output and trigger a streaming update.
While this would work, I'm concerned as to what happens when max_tokens
is hit mid-unicode character 🤔. Probably not too big of an issue as we can ask for more tokens later.
Looking at the different error handling modes of bytes.decode()
to maybe understand the problem more:
It does not look like changing the error handling for bytes.decode()
will provide an easy fix. Though we do see that the mode 'strict'
will give us a very clear signal as to when the decode has explicitly failed via a raised UnicodeError
exception.
A very simplified way to implement te buffer might be like:
# this is an oversimplified and contrived example to illustrate how I
# think this should work
def create_completion(self, prompt: list[int], stream: bool = True, ...):
...
buffer: list[int] = []
# main token generation loop
for token in self.generate(...):
# handle End token
if token == self.token_eos():
break
# expand buffer
buffer.append(token)
try:
# strict error handling to generate `UnicodeError` and/or
# `ValueError`.
yield self.detokenize(buffer).decode("utf-8", errors="strict")
except (UnicodeError, ValueError):
# if exception: then we have an incomplete character and need
# more tokens to complete it, so continue the loop generating
# tokens.
continue
else:
# if no exception: clear buffer.
buffer = []
# handle end of generation with non-empty buffer:
if len(buffer) > 0:
# I don't know what type of error handling should be used here,
# 'strict' will likely raise exceptions but 'ignore' will give no
# and/or bad output.
yield self.detokenize(buffer).decode("utf-8", errors="strict")
buffer = []
...
This implementation has a few problems:
create_completion
preforms.max_token
limit or the end of the context window.The last issue can be alleviated or even outright fixed by implementing a LogitsProcessor
that looks back 4 or so tokens looking for tokens that are 1-byte long and indicate the start of a unicode character and limit logits to tokens that correctly complete the indicated encoding length.
As an example:
110x xxxx
(first byte for a 2-byte character).utf-8
encoding the very next token must be 1 byte long and follow the pattern 10xx xxxx
(the continuation encoding).10xx xxxx
.Any who, whatever method is decided upon I just hope it's effective.
So, is it just not important to have working streaming? Or should we write more solutions? Should I stop implementing future support for streaming in my app because it will never be fixed anyway? It would save me some trouble trying to wrap this into a process to work around the GPU memory leak that will probably never be fixed either.
So, is it just not important to have working streaming? Or should we write more solutions? Should I stop implementing future support for streaming in my app because it will never be fixed anyway? It would save me some trouble trying to wrap this into a process to work around the GPU memory leak that will probably never be fixed either.
While I'm not confident enough in my abilities to make a pull request fixing this in the library implementation, I am putting the final touches on a simple re-implementation of create_completion()
that doesn't have the Unicode streaming problem. Idk if it will be done today or tomorrow as I'm on my anemic laptop for the next week.
Here is my implementation of a create_completion()
method that is capable of outputting multi-token utf-8 encoded text in a streaming fashion. I tried to document as much of what is going on internally as I could, maybe to an excessive degree. (I guess I'm compensating for how poorly commented llama-cpp-python is at the moment)
The goal with the structure of the code is to break up each step in the process of generating the tokens, checking for stoppage, and decoding the tokens, into their own small generator sub-functions to more clearly understand each step. This leads to some mild buffer shenanigans, but its not too difficult to understand in each sub-function.
I hav not run this exact code but I did run these prompts and got the noted responses, And made sure to include responses that had multi-token characters in them.
from llama_cpp import Llama
from test_create_completion import create_completion # or whatever you call the file
model = Llama( ... )
stop_strings = ["Q:", "A:", "###"]
# Q: "Describe a turtle."
prompt1 = "Q: 거북이를 묘사하십시오.\nA:".encode('utf-8')
for chunk in create_completion(model, model.tokenize(prompt1), stop_strings):
print(str(chunk), end="", flush=True)
# 쉽친다지다그.
# translation from google: "It's easy to do that."
prompt2 = "Q: What is your favorite emoji?\nA:".encode('utf-8')
for chunk in create_completion(model, model.tokenize(prompt2), stop_strings):
print(str(chunk), end="", flush=True)
# 🤗
If you have any questions I'll be glad to answer them, but I'm not going to act as support about it.
@iactix Did my code help at all?
I wasn't able to try it yet. Note that I am not involved in working on llama-cpp-python, at least not yet. Have you tried making a pull request for your fix?
Have you tried making a pull request for your fix?
Unfortunately my code isn't compatible with the base library. llama-cpp-python has a rather complex and monolithic implementation of create_completion
that I don't fully understand as it is difficult to grok why certain things are done. I have a mild idea on how I can go about fixing the problem in a pull request, but I'm unsure of it being fit to the library requirements. Mainly the goal of being compatible with the OpenAI API since I have no clue about its intricacies.
My code is more of a functional demonstration as to how a fix could be implemented. It is what I am currently using, and hopefully is useable enough for a temporary alternative. It should be relatively trivial to wrap my create_completion
implementation to be a fully compatible replacement (I think, my code is more fit for my own purposes than for drop-in replacement).
I also did a rather straightforward fix myself by simply removing all these .decode
, making it yield bytes
instead of str
all the time. This allows me to handle UTF-8 encoding properly on the caller side.
I was looking through the commits and it looks like it might be fixed. https://github.com/abetlen/llama-cpp-python/blob/18337267c175f88c46fbad079a7c18da57e1b520/llama_cpp/llama.py#L1069-L1075
I have not tested it yet though.
Implemented streaming in my solution, got no emojis, then updated to 0.1.84, repeated the exact situation, now I'm getting emojis! 🥳
Probably needs some more proper testing but it seems fine so far!
Has this issue been resolved?
I am getting a strange blank character at the start of my responses. From the other issues I've seen they center around emojis and multipart foreign language characters, but the comments say its fixed but issues remain open...
I used the latest release version , but I still got the display errors in chinese ,when I use llama2 code ,
llm = Llama.from_pretrained(
repo_id=args.model_name_or_path,
chat_format="llama-2",
filename="phind-codellama-34b-v2.Q6_K.gguf",
n_ctx=12000,
tokenizer=tokenizer,
n_gpu_layers=-1,
verbose=False
)
output = llm.create_chat_completion(
messages=[
{ "role": "system", "content": """You are an AI that follows instructions extremely well.
Help as much as you can. Remember, response in Chinese.
"""},
{
"role": "user",
"content": user_prompt
}
],
stream=True,
mirostat_mode = 2, mirostat_tau=4.0, mirostat_eta=1.1
)
start_time = time.time()
bot_message =''
print('Human:',history[-1][0])
print('Assistant: ',end='',flush=True)
full_content=""
for chunk in output:
delta = chunk['choices'][0]['delta']
if 'role' in delta:
print(delta['role'], end=': ')
elif 'content' in delta:
# 对内容进行编码再解码处理,以避免乱码
content_encoded = delta['content'].encode('utf-8')
content_decoded = content_encoded.decode('utf-8')###
I have noticed a very weird change when I wanted to make use of streaming. Before I was not, and basically all conversation models tended to start their message with an emoji. The reasons why the models are so fixated on starting the message that way is unclear to me, but the emojis clearly made sense and represented fitting emotions and such.
When I now tried to integrate streaming, I noticed the first few output chunks are empty, and now I don't see the model use emojis at all anymore. Certainly not at the start, only superfluous spaces. And less spaces than there were superfluous generation chunks before the text.
This leads me to the suspicion, that the chunking for streaming is breaking up unicode characters that generate from multiple tokens and can not be converted from byte buffer to string individually. And that causes a broken result, since chaining the outputs does not give you back the correct unicode symbol.
Might there be something to this? Or am I doing something wrong?
Edit: I am not using "echo", I imagine that would somewhat fix it. Probably still shouldn't have this effect one way or the other.