abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.23k stars 981 forks source link

[BUG?] `finish_reason` is None when using `create_chat_completion(stream=True)` #735

Closed NightMachinery closed 1 year ago

NightMachinery commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Current Behavior

from icecream import ic

from llama_cpp import Llama
from llama_cpp import ChatCompletionMessage

llm = Llama(
    model_path="/opt/models/WizardCoder-Python-34B-V1.0/wizardcoder-python-34b-v1.0.Q4_K_M.gguf",
    n_gpu_layers=-1,
    # n_gpu_layers=0,
)

def print_chat_streaming(output, debug_p=True,):
    """
    Process and print out chat completions from a model when the stream is set to True.

    Args:
        output (iterable): The output from the model with stream=True.
    """
    for r in output:
        delta = r["choices"][0]['delta']
        if 'role' in delta:
            print(f"\n{delta['role']}: ", end='')
        if 'content' in delta:
            print(f"{delta['content']}", end='')
    print("\n")

    if debug_p == True:
        ic(r)
output = llm.create_chat_completion(
    messages=[
        ChatCompletionMessage(
            # role="user",
            role="system",
            content=r"""You're a helpful programming assistant who answers the questions the user asks of you concisely and accurately. As you're a senior engineer working at Google with a PhD in distributed systems, you're extremely smart. You take a deep breath before answering the question and solve the question step by step.""",
        ),
        ChatCompletionMessage(
            role="user",
            content=r"""List groups my linux user is in""",
        ),
    ],
    max_tokens=256,
    stop=[],
    temperature=0,
    stream=True,
)

print_chat_streaming(output)
Llama.generate: prefix-match hit

assistant: To list all the groups that your Linux user belongs to, run the following command:

...

This will display a space-separated list of all the groups that you belong to. 

ic| r: {'choices': [{'delta': {'content': ' '}, 'finish_reason': None, 'index': 0}],
        'created': 1695045448,
        'id': 'chatcmpl-4f8489ed-f56a-4fa9-b42f-cdc753de93b8',
        'model': '/opt/models/WizardCoder-Python-34B-V1.0/wizardcoder-python-34b-v1.0.Q4_K_M.gguf',
        'object': 'chat.completion.chunk'}

llama_print_timings:        load time =   458.52 ms
llama_print_timings:      sample time =    42.46 ms /    63 runs   (    0.67 ms per token,  1483.61 tokens per second)
llama_print_timings: prompt eval time =   411.41 ms /    12 tokens (   34.28 ms per token,    29.17 tokens per second)
llama_print_timings:        eval time =  2738.43 ms /    62 runs   (   44.17 ms per token,    22.64 tokens per second)
llama_print_timings:       total time =  3349.54 ms

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

$ lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   45 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          64
On-line CPU(s) list:             0-63
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
CPU family:                      6
Model:                           85
Thread(s) per core:              1
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        7
BogoMIPS:                        4190.15
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       2 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        64 MiB (64 instances)
L3 cache:                        55 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-31
NUMA node1 CPU(s):               32-63
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Mitigation; Enhanced IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

$ uname -a

Linux gpu7 5.15.0-75-generic #82-Ubuntu SMP Tue Jun 6 23:10:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ python3 --version
Python 3.10.12

$ make --version
GNU Make 4.4.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
abetlen commented 1 year ago

@NightMachinery are you saying finish_reason is None on the last message? That would be a bug, otherwise it's in line with the OpenAI API

NightMachinery commented 1 year ago

@abetlen Yes, finish_reason is None on the last message.

earonesty commented 1 year ago

workaround: check if the content is empty and stuff in a finish_reason.

gmcgoldr commented 1 year ago

I encountered the same issue. I did some digging around the code base and I might have an explanation:

In summary: there is a return statement that is reached when stream == True and prevents the function from reaching the final yield statement which contains the finish_reason. There are two other yield statements that can be reached when stream == True and that could contain the finish_reason, but they are not reached when remaining_tokens is empty.

I think there are two possible solutions, but without a deeper understanding of the code base I can't know which (if either) is correct:

  1. Remove the return statement (it's not clear to me that it needs to be there): https://github.com/abetlen/llama-cpp-python/blob/43dfe1e2abef2ef0d873732ed65986eb9c3e379f/llama_cpp/llama.py#L1282
  2. Pull out (and consolidate) the yields from the remaining_tokens loop so that the yield is triggered even when remaining_tokens is empty.
gmcgoldr commented 1 year ago

I read over _create_completion and it's clear that the return statement is correct: the following code is already executed in the if stream branch from which return is called.

I opened a PR that implements the 2nd solution, it works for me. @NightMachinery you can use it if you still need the issue fixed.