NightMachinery commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

from icecream import ic

from llama_cpp import Llama
from llama_cpp import ChatCompletionMessage

llm = Llama(
    model_path="/opt/models/WizardCoder-Python-34B-V1.0/wizardcoder-python-34b-v1.0.Q4_K_M.gguf",
    n_gpu_layers=-1,
    # n_gpu_layers=0,
)

def print_chat_streaming(output, debug_p=True,):
    """
    Process and print out chat completions from a model when the stream is set to True.

    Args:
        output (iterable): The output from the model with stream=True.
    """
    for r in output:
        delta = r["choices"][0]['delta']
        if 'role' in delta:
            print(f"\n{delta['role']}: ", end='')
        if 'content' in delta:
            print(f"{delta['content']}", end='')
    print("\n")

    if debug_p == True:
        ic(r)

output = llm.create_chat_completion(
    messages=[
        ChatCompletionMessage(
            # role="user",
            role="system",
            content=r"""You're a helpful programming assistant who answers the questions the user asks of you concisely and accurately. As you're a senior engineer working at Google with a PhD in distributed systems, you're extremely smart. You take a deep breath before answering the question and solve the question step by step.""",
        ),
        ChatCompletionMessage(
            role="user",
            content=r"""List groups my linux user is in""",
        ),
    ],
    max_tokens=256,
    stop=[],
    temperature=0,
    stream=True,
)

print_chat_streaming(output)

Llama.generate: prefix-match hit

assistant: To list all the groups that your Linux user belongs to, run the following command:

...

This will display a space-separated list of all the groups that you belong to. 

ic| r: {'choices': [{'delta': {'content': ' '}, 'finish_reason': None, 'index': 0}],
        'created': 1695045448,
        'id': 'chatcmpl-4f8489ed-f56a-4fa9-b42f-cdc753de93b8',
        'model': '/opt/models/WizardCoder-Python-34B-V1.0/wizardcoder-python-34b-v1.0.Q4_K_M.gguf',
        'object': 'chat.completion.chunk'}

llama_print_timings:        load time =   458.52 ms
llama_print_timings:      sample time =    42.46 ms /    63 runs   (    0.67 ms per token,  1483.61 tokens per second)
llama_print_timings: prompt eval time =   411.41 ms /    12 tokens (   34.28 ms per token,    29.17 tokens per second)
llama_print_timings:        eval time =  2738.43 ms /    62 runs   (   44.17 ms per token,    22.64 tokens per second)
llama_print_timings:       total time =  3349.54 ms

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   45 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          64
On-line CPU(s) list:             0-63
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
CPU family:                      6
Model:                           85
Thread(s) per core:              1
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        7
BogoMIPS:                        4190.15
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       2 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        64 MiB (64 instances)
L3 cache:                        55 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-31
NUMA node1 CPU(s):               32-63
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Mitigation; Enhanced IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Operating System, e.g. for Linux:

$ uname -a

Linux gpu7 5.15.0-75-generic #82-Ubuntu SMP Tue Jun 6 23:10:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

SDK version, e.g. for Linux:

$ python3 --version
Python 3.10.12

$ make --version
GNU Make 4.4.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

abetlen commented 1 year ago

@NightMachinery are you saying finish_reason is None on the last message? That would be a bug, otherwise it's in line with the OpenAI API

NightMachinery commented 1 year ago

@abetlen Yes, finish_reason is None on the last message.

earonesty commented 1 year ago

workaround: check if the content is empty and stuff in a finish_reason.

gmcgoldr commented 1 year ago

I encountered the same issue. I did some digging around the code base and I might have an explanation:

Here is the final yield which contains the finish reason and usage statistics: https://github.com/abetlen/llama-cpp-python/blob/43dfe1e2abef2ef0d873732ed65986eb9c3e379f/llama_cpp/llama.py#L1350
However, when stream == True this code is never executed because of this return statement: https://github.com/abetlen/llama-cpp-python/blob/43dfe1e2abef2ef0d873732ed65986eb9c3e379f/llama_cpp/llama.py#L1282
There are two other yield statements that can provide the finish_reason when stream == True:
- https://github.com/abetlen/llama-cpp-python/blob/43dfe1e2abef2ef0d873732ed65986eb9c3e379f/llama_cpp/llama.py#L1231
- https://github.com/abetlen/llama-cpp-python/blob/43dfe1e2abef2ef0d873732ed65986eb9c3e379f/llama_cpp/llama.py#L1263
However, these yield statements are never reached if remaining_tokens is empty, which I think is normal case (it seems to be the case when all tokens are decoded during the generation loop): https://github.com/abetlen/llama-cpp-python/blob/43dfe1e2abef2ef0d873732ed65986eb9c3e379f/llama_cpp/llama.py#L1176

In summary: there is a return statement that is reached when stream == True and prevents the function from reaching the final yield statement which contains the finish_reason. There are two other yield statements that can be reached when stream == True and that could contain the finish_reason, but they are not reached when remaining_tokens is empty.

I think there are two possible solutions, but without a deeper understanding of the code base I can't know which (if either) is correct:

Remove the return statement (it's not clear to me that it needs to be there): https://github.com/abetlen/llama-cpp-python/blob/43dfe1e2abef2ef0d873732ed65986eb9c3e379f/llama_cpp/llama.py#L1282
Pull out (and consolidate) the yields from the remaining_tokens loop so that the yield is triggered even when remaining_tokens is empty.

gmcgoldr commented 1 year ago

I read over _create_completion and it's clear that the return statement is correct: the following code is already executed in the if stream branch from which return is called.

I opened a PR that implements the 2nd solution, it works for me. @NightMachinery you can use it if you still need the issue fixed.

abetlen / llama-cpp-python

[BUG?] `finish_reason` is None when using `create_chat_completion(stream=True)` #735

Prerequisites

Current Behavior

Environment and Context