iactix commented 1 year ago

I have noticed a very weird change when I wanted to make use of streaming. Before I was not, and basically all conversation models tended to start their message with an emoji. The reasons why the models are so fixated on starting the message that way is unclear to me, but the emojis clearly made sense and represented fitting emotions and such.

When I now tried to integrate streaming, I noticed the first few output chunks are empty, and now I don't see the model use emojis at all anymore. Certainly not at the start, only superfluous spaces. And less spaces than there were superfluous generation chunks before the text.

This leads me to the suspicion, that the chunking for streaming is breaking up unicode characters that generate from multiple tokens and can not be converted from byte buffer to string individually. And that causes a broken result, since chaining the outputs does not give you back the correct unicode symbol.

Might there be something to this? Or am I doing something wrong?

Edit: I am not using "echo", I imagine that would somewhat fix it. Probably still shouldn't have this effect one way or the other.

iactix commented 1 year ago

I have now tried this with echo=True and it seems to me that echo doesn't even work with streaming. That would make this more serious than I thought.

abetlen commented 1 year ago

@iactix can you share some prompts / models so I can repro this? Yes there's currently a bug but that impacts non-utf-8-printable tokens in streaming mode. As for echo though, that's just an OpenAI compatibility idiosynchrasy, their API afaik does not support echo for streamed outputs.

iactix commented 1 year ago

from llama_cpp import Llama
model = "F:/models/airoboros-7b-gpt4-1.2.ggmlv3.q3_K_M.bin"
prompt = "USER:\nWrite 5 emojis, each followed by a short description\nASSISTANT:\n"

llm = Llama(model_path=model, n_ctx=512, last_n_tokens_size=256, n_threads=4, n_gpu_layers=0)
result = llm.create_completion(prompt, repeat_penalty=1.1, max_tokens=256, stop=["USER:", "ASSISTANT:"], echo=False, temperature=0, mirostat_mode = 2, mirostat_tau=4.0, mirostat_eta=1.1)
print ("stream = False")
print(result['choices'][0]['text'])

stream = llm.create_completion(prompt, stream=True, repeat_penalty=1.1, max_tokens=256, stop=["USER:", "ASSISTANT:"], echo=False, temperature=0, mirostat_mode = 2, mirostat_tau=4.0, mirostat_eta=1.1)
result = ""
for output in stream:
    result += output['choices'][0]['text']
print ("stream = True")
print(result)

Without streaming:

😂 - Laughing face
🤑 - Money bag
🍕 - Pizza slice
🌲 - Tree
🏃‍♂️ - Running man

With streaming:

- Laughing face
- Money bag
- Pizza slice
- Tree
‍♂️ - Running man

Edit: Oh, I meant to set mirostat_eta to 0.1, but hey, the example with 1.1 works.

nai-kon commented 1 year ago

I thinks this is the same issue on below. There is a problem on decoding multi-byte character on streaming mode. https://github.com/abetlen/llama-cpp-python/issues/286

tarpeyd12 commented 1 year ago

Having a look-see it seems to me that the problem is calling .decode("utf-8", errors="ignore") on single tokens bytes, since when stream=True it yields completion chunks per-token, and Unicode characters are often composed of multiple tokens the utf-8 decode fails. A naïve solution would be to include the raw tokens alongside the decoded text, and to allow the caller to handle what to do with the raw tokens, if anything.

https://github.com/abetlen/llama-cpp-python/blob/3e7eae479631890196823324e0573416408f52a0/llama_cpp/llama.py#L1047 https://github.com/abetlen/llama-cpp-python/blob/3e7eae479631890196823324e0573416408f52a0/llama_cpp/llama.py#L1063

iactix commented 1 year ago

A naïve solution would be to include the raw tokens alongside the decoded text, and to allow the caller to handle what to do with the raw tokens, if anything.

A less naive solution would be to have a tiny internal buffer that accumulates output until a full utf-8 sequence can be decoded, and only then it will transfer that to the actual output and trigger a streaming update. UTF-8 encoding should allow to differentiate between a character that is still incomplete, and one that was somehow invalid. That way it could be ensured that a broken character can not affect generation after that.

tarpeyd12 commented 1 year ago

a tiny internal buffer that accumulates output until a full utf-8 sequence can be decoded, and only then it will transfer that to the actual output and trigger a streaming update.

While this would work, I'm concerned as to what happens when max_tokens is hit mid-unicode character 🤔. Probably not too big of an issue as we can ask for more tokens later.

Looking at the different error handling modes of bytes.decode() to maybe understand the problem more:

Show Table

Link to the python docs: - `bytes.decode()`: https://docs.python.org/3/library/stdtypes.html#bytes.decode - `error-handlers`: https://docs.python.org/3/library/codecs.html#error-handlers | Mode | Description | Concern | | :------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `'strict'` | Default. Raise `UnicodeError` (or a subclass), this is the default. | Can be used to identify when to start and stop accumulating to an internal buffer via try-except block. | | `'ignore'` | Ignore the malformed data and continue without further notice. | The current way in which llama-cpp-python decodes tokens when `stream=True`. Outputs an empty string when given an incomplete unicode character as we have seen. | | `'replace'` | Replace with a replacement marker. On encoding, use `?` (ASCII character). On decoding, use `�` (U+FFFD, the official REPLACEMENT CHARACTER). | Potentially useful in the same way that `'strict'` is potentially useful, but would have false positives when checking for `?` and `�` as they would occur elsewhere. May require the full unicode character, I have not checked. | | `'backslashreplace'` | Replace with backslashed escape sequences. On encoding, use hexadecimal form of Unicode code point with formats `\xhh` `\uxxxx` `\Uxxxxxxxx`. On decoding, use hexadecimal form of byte value with format `\xhh`. | Requires the full unicode character in the first place so it does not seem useful. | | `'surrogateescape'` | On decoding, replace byte with individual surrogate code ranging from `U+DC80` to `U+DCFF`. This code will then be turned back into the same byte when the `'surrogateescape'` error handler is used when encoding the data. | Requires the full unicode character in the first place so it does not seem useful. |

It does not look like changing the error handling for bytes.decode() will provide an easy fix. Though we do see that the mode 'strict' will give us a very clear signal as to when the decode has explicitly failed via a raised UnicodeError exception.

A very simplified way to implement te buffer might be like:

# this is an oversimplified and contrived example to illustrate how I
# think this should work
def create_completion(self, prompt: list[int], stream: bool = True, ...):
    ...

    buffer: list[int] = []

    # main token generation loop
    for token in self.generate(...):

        # handle End token
        if token == self.token_eos():
            break

        # expand buffer
        buffer.append(token)

        try:
            # strict error handling to generate `UnicodeError` and/or
            # `ValueError`.
            yield self.detokenize(buffer).decode("utf-8", errors="strict")

        except (UnicodeError, ValueError):
            # if exception: then we have an incomplete character and need
            # more tokens to complete it, so continue the loop generating
            # tokens.
            continue

        else:
            # if no exception: clear buffer.
            buffer = []

    # handle end of generation with non-empty buffer:
    if len(buffer) > 0:
        # I don't know what type of error handling should be used here,
        # 'strict' will likely raise exceptions but 'ignore' will give no
        # and/or bad output.
        yield self.detokenize(buffer).decode("utf-8", errors="strict")
        buffer = []
    ...

This implementation has a few problems:

Unclear how to handle logprobs and other bookkeeping that create_completion preforms.
It does not handle a scenario where the generation reaches the max_token limit or the end of the context window.
It does not natively handle a scenario where the llm decides to generate a bad unicode character "on purpose".

The last issue can be alleviated or even outright fixed by implementing a LogitsProcessor that looks back 4 or so tokens looking for tokens that are 1-byte long and indicate the start of a unicode character and limit logits to tokens that correctly complete the indicated encoding length.

As an example:

If the last token output decodes to 1 byte in length and has the pattern 110x xxxx (first byte for a 2-byte character).
Then following utf-8 encoding the very next token must be 1 byte long and follow the pattern 10xx xxxx (the continuation encoding).
And since the llm decided it wanted a 2-byte long character we then disallow the next token after the 2-byte character from being 1-byte long and from following the pattern 10xx xxxx.

Any who, whatever method is decided upon I just hope it's effective.

iactix commented 1 year ago

So, is it just not important to have working streaming? Or should we write more solutions? Should I stop implementing future support for streaming in my app because it will never be fixed anyway? It would save me some trouble trying to wrap this into a process to work around the GPU memory leak that will probably never be fixed either.

tarpeyd12 commented 1 year ago

So, is it just not important to have working streaming? Or should we write more solutions? Should I stop implementing future support for streaming in my app because it will never be fixed anyway? It would save me some trouble trying to wrap this into a process to work around the GPU memory leak that will probably never be fixed either.

While I'm not confident enough in my abilities to make a pull request fixing this in the library implementation, I am putting the final touches on a simple re-implementation of create_completion() that doesn't have the Unicode streaming problem. Idk if it will be done today or tomorrow as I'm on my anemic laptop for the next week.

tarpeyd12 commented 1 year ago

Here is my implementation of a create_completion() method that is capable of outputting multi-token utf-8 encoded text in a streaming fashion. I tried to document as much of what is going on internally as I could, maybe to an excessive degree. (I guess I'm compensating for how poorly commented llama-cpp-python is at the moment)

The goal with the structure of the code is to break up each step in the process of generating the tokens, checking for stoppage, and decoding the tokens, into their own small generator sub-functions to more clearly understand each step. This leads to some mild buffer shenanigans, but its not too difficult to understand in each sub-function.

Usage examples

I hav not run this exact code but I did run these prompts and got the noted responses, And made sure to include responses that had multi-token characters in them.

from llama_cpp import Llama
from test_create_completion import create_completion  # or whatever you call the file

model = Llama( ... )

stop_strings = ["Q:", "A:", "###"]

# Q: "Describe a turtle."
prompt1 = "Q: 거북이를 묘사하십시오.\nA:".encode('utf-8')
for chunk in create_completion(model, model.tokenize(prompt1), stop_strings):
    print(str(chunk), end="", flush=True)
# 쉽친다지다그.
# translation from google: "It's easy to do that."

prompt2 = "Q: What is your favorite emoji?\nA:".encode('utf-8')
for chunk in create_completion(model, model.tokenize(prompt2), stop_strings):
    print(str(chunk), end="", flush=True)
# 🤗

Code

If you have any questions I'll be glad to answer them, but I'm not going to act as support about it.

Click to Expand

```python """An experimental module that aims to provide a simple and readable re-implementation of llama-cpp-python's `Llama.create_completion()` that is capable of streaming complex multi-token utf-8 encoded characters. Code is provided "As is", you are responsible for ensuring proper functionality and safety before use. """ __author__ = "github.com/tarpeyd12" __copyright__ = "Copyright 2023" __date__ = "2023/07/19" __license__ = "MIT" # covering my butt against foolhardy copy-pasters import sys from dataclasses import dataclass from typing import Any, Generator, Iterable, Literal, Mapping, Sequence from llama_cpp import Llama, LogitsProcessor, LogitsProcessorList, StoppingCriteria, StoppingCriteriaList __all__ = ["create_completion", "ChunkContent", "save_cache", "load_cache"] @dataclass(slots=True) class ChunkContent: """Dataclass that contains the utf-8 encoded text of a given sequence of \ tokens, as well as those tokens in sequence. :: ChunkContent("", [], "") # Invalid ChunkContent(" A B C" ..., [], "") # Invalid ChunkContent(" A B C" ..., [319, 350, 315, ...], "") # Valid ChunkContent("", [319, 350, 315, ...], "") # Valid """ text: str tokens: list[int] stop_reason: Literal["", "stop", "length"] def __str__(self) -> str: return self.text def save_cache(llama_model: Llama, tokens: Sequence[int]) -> bool: """Saves the `tokens` sequence to `llama_model`'s cache. Copied from `llama_cpp.Llama._create_completion()`. https://github.com/abetlen/llama-cpp-python/blob/36872620d03dee77117c34699aa007b81bb4e319/llama_cpp/llama.py#L1136 Args: llama_model (Llama): The llama-cpp-python Llama model to save to. tokens (Sequence[int]): The sequence of tokens to save. Returns: bool: `True` is returned when the cache is saved to. `False` if the \ save failed or the `llama_model` given does not have a cache. """ if llama_model.cache: if llama_model.verbose: print("Llama._create_completion: cache save", file=sys.stderr) llama_model.cache[tokens] = llama_model.save_state() if llama_model.verbose: print("Llama._create_completion: cache saved", file=sys.stderr) return True return False def load_cache(llama_model: Llama, prompt_tokens: Sequence[int]) -> bool: """Checks the cache of the given `llama_model` for the token sequence \ `prompt_tokens`, and loads loads the saved context from the cache if \ a hit is made. Copied from `llama_cpp.Llama._create_completion()` https://github.com/abetlen/llama-cpp-python/blob/36872620d03dee77117c34699aa007b81bb4e319/llama_cpp/llama.py#L872 Args: llama_model (Llama): The Llama model to check the cache of. prompt_tokens (Sequence[int]): The token sequence to search for in the \ cache. Returns: bool: `True` if the cache was hit and context loaded into \ `llama_model`. `False` if the cache was missed and not context \ loaded, or `llama_model` given does not have a cache. """ if llama_model.cache: try: cache_item = llama_model.cache[prompt_tokens] cache_prefix_len = Llama.longest_token_prefix(cache_item.input_ids.tolist(), prompt_tokens) eval_prefix_len = Llama.longest_token_prefix(llama_model._input_ids.tolist(), prompt_tokens) if cache_prefix_len > eval_prefix_len: llama_model.load_state(cache_item) if llama_model.verbose: print("Llama._create_completion: cache hit", file=sys.stderr) return True except KeyError: if llama_model.verbose: print("Llama._create_completion: cache miss", file=sys.stderr) return False return False class _Stopper: """A `StoppingCriteria` compatible `Callable` class that returns `True` \ when the given token sequence contains the given stop strings. Not intended to be used externally. """ def __init__(self, stop: Sequence[str] | str, tokenizer: Llama) -> None: """Initialize the stopper object with the stop sequences and tokenizer. Args: stop (Sequence[str] | str): The string or set of strings to stop \ generation on. tokenizer (Llama): The Llama model to use as a tokenizer. \ `detokenize()` is the only method used. """ stop = (stop if isinstance(stop, list) else [stop] if isinstance(stop, str) else []) self.__stop_sequences = sorted([bytes(seq, encoding="utf-8") for seq in stop], key=lambda b: len(b)) self.__tokenizer = tokenizer self.__max_sequence_len: int = max((len(s) for s in self.__stop_sequences)) self.__found_sequence: bytes | None = None def __call__(self, input_ids: list[int], scores: list[float] = []) -> bool: """Checks if the given token sequence `input_id`'s contains the stop \ sequences. Args: input_ids (list[int]): The sequence of token to check. scores (list[float], optional): Ignored. Only present for \ compatibility with `StoppingCriteria`. Defaults to []. Returns: bool: `True` if the stop sequence is present in `input_ids`. \ `False` if not. """ # make sure we only see what we need tokens = input_ids[-(self.__max_sequence_len + 1):] # reset the found sequence since we are searching again self.__found_sequence = None byte_string = b"" for seq in self.__stop_sequences: # expand the byte string to be just a bit larger than the sequence # we are comparing against while len(byte_string) < len(seq) + 1 and len(tokens) > 0: byte_string = self.__tokenizer.detokenize([tokens.pop()]) + byte_string if seq in byte_string: # we got one! self.__found_sequence = seq # save which one we got # end the search with a positive result return True # None found return False @property def max_sequence_len(self) -> int: """The number of bytes of the longest stop sequence.""" return self.__max_sequence_len @property def found_sequence(self) -> bytes | None: """The bytes of the sequence found, `None` if no sequence found.""" return self.__found_sequence def create_completion( llama_model: Llama, prompt_tokens: Sequence[int], stop: Sequence[str], max_tokens: int = 128, sampler: Mapping[str, Any] = {"top_k": 40, "top_p": 0.95, "temp": 0.8, "repeat_penalty": 1.1}, logit_processors: Sequence[LogitsProcessor] | LogitsProcessorList | None = None, stopping_criteria: Sequence[StoppingCriteria] | StoppingCriteriaList | None = None ) -> Generator[ChunkContent, None, None]: """Generate a stream of chunks that complete the given prompt tokens. Example:: stream = create_completion( model, model.tokenize(b"Q: What is your favorite Emoji?\\nA:"), ["Q:", "A:", "###"] ) for chunk in stream: print(str(chunk), end="", flush=True) Args: llama_model (Llama): the Llama model to use. prompt_tokens (Sequence[int]): The prompt as a sequence of tokens. stop (Sequence[str]): The stop strings. Generation will stop if it \ sees any of these strings. max_tokens (int, optional): Upper bounds of the number of tokens to \ generate. The model will not be asked to generate more than \ max_tokens, but the chunks may contain more than max_tokens. \ Defaults to 128. sampler (Mapping[str, Any], optional): Sampling parameters. Only \ accepts parameters that `llama_cpp.Llama.generate()` uses. \ Defaults to {"top_k": 40, "top_p": 0.95, "temp": 0.8, \ "repeat_penalty": 1.1}. logit_processors (Sequence[LogitsProcessor] | LogitsProcessorList | \ None, optional): Defaults to None. stopping_criteria (Sequence[StoppingCriteria] | StoppingCriteriaList \ | None, optional): Defaults to None. Raises: ValueError: Raised when there is not enough space in the model \ context window to generate completion tokens. Yields: Generator[ChunkContent, None, None]: Generator object that emits \ ChunkContent objects that contain the text completion. """ verbose = llama_model.verbose # cache info len_prompt_tokens = len(prompt_tokens) context_size = llama_model.n_ctx() token_eos = llama_model.token_eos() # fudge max_tokens to stay under the n_ctx() context size limit max_tokens = min(max_tokens, context_size - len_prompt_tokens) if max_tokens <= 0: raise ValueError(f"Not enough space left in context window. " f"n_ctx = {context_size} n_tokens = {len_prompt_tokens}") # establish the logit processors if logit_processors is None: _logit_processors = LogitsProcessorList() elif not isinstance(logit_processors, LogitsProcessorList): _logit_processors = LogitsProcessorList(logit_processors) # establish the stopping criteria if stopping_criteria is None: _stopping_criteria = StoppingCriteriaList() elif not isinstance(stopping_criteria, StoppingCriteriaList): _stopping_criteria = StoppingCriteriaList(stopping_criteria) # stop_reason is modified by _generate_tokens(), _process_stops() # and is read from by _inject_stop_reason() # Accumulates the various stop reasons that may occur during generation. # More than one may happen, ie. the stop sequence may *exactly* hit the # length limit. Only the first reason encountered is reported. stop_reason: list[Literal["", "stop", "length"]] = [] # Sub-generator definitions: def _generate_tokens() -> Generator[int, None, None]: """Wraps `Llama.generate()`. Outputs the sequence of tokens with the \ given prompt, sampler params, logit processors, and stopping \ criteria. Stops generating when either the length specified is \ reached ,when the context size is reached, or when encountering \ the stop token. Not intended to be used externally. Side Effects: stop_reason Yields: Generator[int, None, None]: A generator object that yields the \ tokens. Should be passed to `_process_stops()`. """ num_generated_tokens = 0 # main token generation loop for token in llama_model.generate( prompt_tokens, **sampler, logits_processor=_logit_processors, stopping_criteria=_stopping_criteria ): # handle end of stream token if token == token_eos: if verbose: print("my_create_completion: STOPPED eos token") stop_reason.append("stop") return num_generated_tokens += 1 # pass the token on yield token # handle too many tokens if num_generated_tokens >= max_tokens or len_prompt_tokens + num_generated_tokens >= context_size: if verbose: print("my_create_completion: LENGTH EXCEEDED") stop_reason.append("length") return return def _process_stops(tokens: Iterable[int]) -> Generator[int, None, None]: """Takes a raw token stream (excluding the EOS token) and truncates \ the stream when a stop sequence is found. Mutates sequence of \ tokens to ensure no tokens in the output stream will generate any \ of the stop sequences. Not intended to be used externally. Side Effects: stop_reason Args: tokens (Iterable[int]): The input tokens stream. Should exclude \ EOS token. Intended to be the output from \ `_generate_tokens()`. Yields: Generator[int, None, None]: A generator object that yields the \ token stream sans token sequences that correspond to a stop \ sequence. Should be passed to `_process_as_chunks()`. """ # _Stopper does the heavy lifting stopper = _Stopper(stop, llama_model) buffer: list[int] = [] for token in tokens: buffer.append(token) # buffer underflow if len(buffer) < stopper.max_sequence_len: continue if stopper(buffer): # we have stopped. if verbose: print("my_create_completion: STOPPED stop sequence") # output the stop reason stop_reason.append("stop") # now we remove the found stop sequence from the buffer # via re-forming the token buffer without the stop sequence # bytes. # NOTE: this can noticeably change the size of the buffer if # the stop sequence occurs part-way through a token. _bytes_temp = llama_model.detokenize(buffer) _stop_location = _bytes_temp.find(stopper.found_sequence) _bytes_temp = _bytes_temp[:_stop_location] buffer = llama_model.tokenize(_bytes_temp, add_bos=False) break # buffer overflow while len(buffer) > stopper.max_sequence_len: yield buffer.pop(0) # flush buffer after stoppage while buffer: yield buffer.pop(0) return def _process_as_chunks(tokens: Iterable[int]) -> Generator[ChunkContent, None, None]: """Takes a stream of tokens and turns them into a stream of \ `ChunkContent`'s that contain the decoded text the token stream \ denotes. The output chunks will always have at least 1 token \ present. The output chunks will not have text present if the \ tokens in the chunk cannot be utf-8 decoded. Not intended to be used externally. Args: tokens (Iterable[int]): The input tokens stream. Intended to be \ the output from `_process_stops()`. Yields: Generator[ChunkContent, None, None]: Generator that yields the \ chunk stream. Should be passed to `_inject_stop_reason()`. """ # the buffer of tokens to attempt decoding decode_buffer: list[int] = [] for token in tokens: decode_buffer.append(token) try: # attempt decode # use `errors='strict'` to force exceptions on non utf-8 # decodable byte strings text_part = llama_model.detokenize(decode_buffer).decode("utf-8", errors="strict") except (UnicodeError, ValueError): # buffer underflow continue else: # yield decoded text as chunk along with the tokens in the # buffer chunk = ChunkContent(text_part, decode_buffer, "") yield chunk # reset buffer decode_buffer = [] # make sure we pass on all tokens as chunks even if we can't decode # them if len(decode_buffer) > 0: # use `errors='ignore'` because we already know we cant decode # this chunk, so we are relying on the caller to use the chunks # tokens instead of its text. chunk = ChunkContent(llama_model.detokenize(decode_buffer).decode("utf-8", errors="ignore"), decode_buffer, "") yield chunk def _inject_stop_reason(chunks: Iterable[ChunkContent]) -> Generator[ChunkContent, None, None]: """Takes a stream of `ChunkContent`'s and injects the first known \ `stop_reason` into the last chunk in the stream. Not intended to be used externally. Args: chunks (Iterable[ChunkContent]): The stream of chunks. Intended to \ be the output from `_process_as_chunks()`. Yields: Generator[ChunkContent, None, None]: generator object that yields \ the stream of chunks, where the last chunk will have its \ `stop_reason` set to `stop_reason[0]`. """ buffer: list[ChunkContent] = [] for chunk in chunks: buffer.append(chunk) buffer_size = 1 if stop_reason else 0 # buffer underflow only if there is a stop reason present if len(buffer) < buffer_size: continue # buffer overflow while len(buffer) > buffer_size: yield buffer.pop(0) # flush the buffer and assign the stop reason to the chunk while buffer: chunk = buffer.pop(0) chunk.stop_reason = stop_reason[0] if stop_reason else "" yield chunk # load cache load_cache(llama_model, prompt_tokens) # for accumulating the tokens to save into the cache at the end generated_tokens: list[int] = [] # run the generators in a cascade for chunk in _inject_stop_reason(_process_as_chunks(_process_stops(_generate_tokens()))): # save tokens for caching later generated_tokens.extend(chunk.tokens) if verbose: print(f"my_create_completion: yielding " f"{chunk.tokens} " f"'{chunk.text}' " f"{llama_model.detokenize(chunk.tokens)}, " f"{chunk.stop_reason}" ) # dole out the chunks! yield chunk # save cache. Make sure to truncate to context size because the # _process_stops() buffer re-forming can make a *longer* sequence save_cache(llama_model, (list(prompt_tokens) + generated_tokens)[:context_size]) return ```

tarpeyd12 commented 1 year ago

@iactix Did my code help at all?

iactix commented 1 year ago

I wasn't able to try it yet. Note that I am not involved in working on llama-cpp-python, at least not yet. Have you tried making a pull request for your fix?

tarpeyd12 commented 1 year ago

Have you tried making a pull request for your fix?

Unfortunately my code isn't compatible with the base library. llama-cpp-python has a rather complex and monolithic implementation of create_completion that I don't fully understand as it is difficult to grok why certain things are done. I have a mild idea on how I can go about fixing the problem in a pull request, but I'm unsure of it being fit to the library requirements. Mainly the goal of being compatible with the OpenAI API since I have no clue about its intricacies.

My code is more of a functional demonstration as to how a fix could be implemented. It is what I am currently using, and hopefully is useable enough for a temporary alternative. It should be relatively trivial to wrap my create_completion implementation to be a fully compatible replacement (I think, my code is more fit for my own purposes than for drop-in replacement).

Equim-chan commented 1 year ago

I also did a rather straightforward fix myself by simply removing all these .decode, making it yield bytes instead of str all the time. This allows me to handle UTF-8 encoding properly on the caller side.

tarpeyd12 commented 1 year ago

I was looking through the commits and it looks like it might be fixed. https://github.com/abetlen/llama-cpp-python/blob/18337267c175f88c46fbad079a7c18da57e1b520/llama_cpp/llama.py#L1069-L1075

I have not tested it yet though.

iactix commented 1 year ago

Implemented streaming in my solution, got no emojis, then updated to 0.1.84, repeated the exact situation, now I'm getting emojis! 🥳

Probably needs some more proper testing but it seems fine so far!

dieharders commented 9 months ago

Has this issue been resolved?

I am getting a strange blank character at the start of my responses. From the other issues I've seen they center around emojis and multipart foreign language characters, but the comments say its fixed but issues remain open...

hzgdeerHo commented 6 months ago

I used the latest release version , but I still got the display errors in chinese ,when I use llama2 code ,

llm = Llama.from_pretrained( repo_id=args.model_name_or_path, chat_format="llama-2", filename="phind-codellama-34b-v2.Q6_K.gguf", n_ctx=12000,
tokenizer=tokenizer, n_gpu_layers=-1,

        verbose=False
    )

output = llm.create_chat_completion( messages=[ { "role": "system", "content": """You are an AI that follows instructions extremely well. Help as much as you can. Remember, response in Chinese.
"""}, { "role": "user", "content": user_prompt } ],

        stream=True,

        mirostat_mode = 2, mirostat_tau=4.0, mirostat_eta=1.1
    )

    start_time = time.time()
    bot_message =''
    print('Human:',history[-1][0])
    print('Assistant: ',end='',flush=True)
    full_content=""
    for chunk in output:
        delta = chunk['choices'][0]['delta']

        if 'role' in delta:
            print(delta['role'], end=': ')
        elif 'content' in delta:

                    # 对内容进行编码再解码处理，以避免乱码
            content_encoded = delta['content'].encode('utf-8')
            content_decoded = content_encoded.decode('utf-8')###

anyone can HELP? the model is TheBloke/Phind-CodeLlama-34B-v2-GGUF,

abetlen / llama-cpp-python

Potential problem with streaming and unicode #372

Usage examples

Code

anyone can HELP? the model is TheBloke/Phind-CodeLlama-34B-v2-GGUF,