AlessandroSpallina commented 10 months ago

Hi all, I'm unable to find any snippet related to the usage of LamaCpp and ConversationChain integrated with Chainlit and I'm a bit lost at this point:

I can see the LLM output on Chainlit but:
- Intermediate streaming is not working (I see directly the output as a block of text)
- Final streaming is not working (I see directly the output as a block of text)
When I ask the LLM to generate source code I sometimes see in the final answer a bunch of "[Object]" strings, but I can instead perfectly see that code in the intermediate response (see image 1)
When chatting with the LLM I sometimes receive no output (or just the string "Response") from the LLM, but this behaviour only happens using Chainlit, with almost the same code only using LangChain I always have proper answers and streaming (see "Appendix")
- I'm starting to think this behaviour only happens when I run Chainlit with "-w" option and I use the live reload feature 5-10 times

Follow the code to reproduce my issues:

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
import chainlit as cl

@cl.on_chat_start
def main():

    template = """### System Prompt
The following is a friendly conversation between a human and an AI optimized to generate source-code. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know:

### Current conversation:
{history}

### User Message
{input}

### Assistant"""

    prompt = PromptTemplate(template=template, input_variables=["history", "input"])

    n_batch = 4096  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

    # Make sure the model path is correct for your system!
    llm = LlamaCpp(
        model_path="/home/skela/llama.cpp/models/phind-codellama-34b-v2.Q4_K_M.gguf",
        n_batch=n_batch,
        n_ctx=4096,
        temperature=1,
        max_tokens=10000,
        n_threads=64,
        verbose=True, # Verbose is required to pass to the callback manager
        streaming=True
    )

    conversation = ConversationChain(
        prompt=prompt,
        llm=llm,
        memory=ConversationBufferWindowMemory(k=10)
    )

    cl.user_session.set("conv_chain", conversation)

@cl.on_message
async def main(message: str):
    conversation = cl.user_session.get("conv_chain")

    cb = cl.AsyncLangchainCallbackHandler(
        stream_final_answer=True, answer_prefix_tokens=["Assistant"]
    )

    res = await conversation.acall(message, callbacks=[cb])

    # Do any post processing here

    await cl.Message(content=res['response']).send()

Here's my "chainlit run app.py -w" output, as you can see it's clearly written that the callback is never used (?)

llama_new_context_with_model: kv self size  =  768.00 MB
llama_new_context_with_model: compute buffer total size =  561.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-09-01 03:23:56 - HTTP Request: POST http://localhost:41143/ "HTTP/1.1 200 OK"
2023-09-01 03:23:56 - HTTP Request: POST http://localhost:41143/ "HTTP/1.1 200 OK"
2023-09-01 03:24:00 - HTTP Request: POST http://localhost:41143/ "HTTP/1.1 200 OK"
2023-09-01 03:24:07 - HTTP Request: POST http://localhost:41143/ "HTTP/1.1 200 OK"
2023-09-01 03:24:07 - HTTP Request: POST http://localhost:41143/ "HTTP/1.1 200 OK"
2023-09-01 03:24:07 - 3 changes detected
2023-09-01 03:24:07 - HTTP Request: POST http://localhost:41143/ "HTTP/1.1 200 OK"
2023-09-01 03:24:07 - 3 changes detected
/home/skela/anaconda3/envs/codebot/lib/python3.10/site-packages/langchain/llms/llamacpp.py:352: RuntimeWarning: coroutine 'AsyncCallbackManagerForLLMRun.on_llm_new_token' was never awaited
  run_manager.on_llm_new_token(
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

llama_print_timings:        load time =  4927.86 ms
llama_print_timings:      sample time =     1.22 ms /     2 runs   (    0.61 ms per token,  1642.04 tokens per second)
llama_print_timings: prompt eval time =  4927.69 ms /    86 tokens (   57.30 ms per token,    17.45 tokens per second)
llama_print_timings:        eval time =   213.56 ms /     1 runs   (  213.56 ms per token,     4.68 tokens per second)
llama_print_timings:       total time =  5149.67 ms
2023-09-01 03:24:12 - HTTP Request: POST http://localhost:41143/ "HTTP/1.1 200 OK"
2023-09-01 03:24:12 - HTTP Request: POST http://localhost:41143/ "HTTP/1.1 200 OK"
2023-09-01 03:24:12 - 6 changes detected

Used LLM

Phind-CodeLlama-34B-v2-GGUF

Image 1

Additional info

(codebot) skela@bengala:~/DEVELOP/codellama-chainlit$ pip freeze
aenum==3.1.15
aiofiles==23.2.1
aiohttp==3.8.5
aiosignal==1.3.1
anyio==3.7.1
async-timeout==4.0.3
asyncer==0.0.2
attrs==23.1.0
auth0-python==4.4.1
backoff==2.2.1
bidict==0.22.1
certifi==2023.7.22
cffi==1.15.1
chainlit==0.6.3
charset-normalizer==3.2.0
cheshire-cat-api==1.0.1
click==8.1.7
cryptography==41.0.3
dataclasses-json==0.5.14
Deprecated==1.2.14
diskcache==5.6.3
exceptiongroup==1.1.3
fastapi==0.97.0
fastapi-socketio==0.0.10
filetype==1.2.0
frozenlist==1.4.0
googleapis-common-protos==1.60.0
greenlet==2.0.2
grpcio==1.57.0
h11==0.14.0
httpcore==0.17.3
httpx==0.24.1
idna==3.4
importlib-metadata==6.8.0
Jinja2==3.1.2
langchain==0.0.278
langsmith==0.0.31
Lazify==0.4.0
llama-cpp-python==0.1.83
MarkupSafe==2.1.3
marshmallow==3.20.1
multidict==6.0.4
mypy-extensions==1.0.0
nest-asyncio==1.5.7
nodeenv==1.8.0
numexpr==2.8.5
numpy==1.25.2
opentelemetry-api==1.19.0
opentelemetry-exporter-otlp==1.19.0
opentelemetry-exporter-otlp-proto-common==1.19.0
opentelemetry-exporter-otlp-proto-grpc==1.19.0
opentelemetry-exporter-otlp-proto-http==1.19.0
opentelemetry-instrumentation==0.40b0
opentelemetry-proto==1.19.0
opentelemetry-sdk==1.19.0
opentelemetry-semantic-conventions==0.40b0
packaging==23.1
prisma==0.9.1
protobuf==4.24.2
pycparser==2.21
pydantic==1.10.12
PyJWT==2.8.0
pyOpenSSL==23.2.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-engineio==4.6.1
python-graphql-client==0.4.3
python-socketio==5.8.0
PyYAML==6.0.1
requests==2.31.0
sniffio==1.3.0
SQLAlchemy==2.0.20
starlette==0.27.0
syncer==2.0.3
tenacity==8.2.3
tomli==2.0.1
tomlkit==0.12.1
typing-inspect==0.9.0
typing_extensions==4.7.1
uptrace==1.19.0
urllib3==2.0.4
uvicorn==0.22.0
watchfiles==0.19.0
websocket-client==1.6.1
websockets==11.0.3
wrapt==1.15.0
yarl==1.9.2
zipp==3.16.2

(codebot) skela@bengala:~/DEVELOP/codellama-chainlit$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

(codebot) skela@bengala:~/DEVELOP/codellama-chainlit$ uname -a
Linux bengala 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

(codebot) skela@bengala:~/DEVELOP/codellama-chainlit$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  128
  On-line CPU(s) list:   0-127
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           2
    Stepping:            6
    CPU max MHz:         3500.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4400.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep
                         _good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                         er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsg
                         sbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsav
                         es cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes
                          vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   3 MiB (64 instances)
  L1i:                   2 MiB (64 instances)
  L2:                    80 MiB (64 instances)
  L3:                    96 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,12
                         2,124,126
  NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,12
                         3,125,127
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Appendix:

This code only uses LangChain and proves I have streaming and always proper output

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory

# template = """### System Prompt
# Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:

# ### User Message
# {question}

# ### Assistant"""

# prompt = PromptTemplate(template=template, input_variables=["question"])

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

n_batch = 4096  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/home/skela/llama.cpp/models/phind-codellama-34b-v2.Q4_K_M.gguf",
    n_batch=n_batch,
    callback_manager=callback_manager,
    n_ctx=4096,
    temperature=1,
    max_tokens=10000,
    n_threads=64,
    verbose=True, # Verbose is required to pass to the callback manager
)

conversation = ConversationChain(
    llm=llm,
    memory=ConversationBufferWindowMemory(k=10)
)

conversation("write me a buggy source code, please")

willydouhard commented 10 months ago

Thank you for you well written and reproducible issue.

Except for the llm instantiation (more on that later) the code looks OK and should work. However, running it I noticed a bug in the langchain lamacpp code:

So what I did is to switch back to the sync implementation and wrap it in cl.make_async. I also moved the llm instantiation out of the cl.on_chat_start (would happen one time for each user which does not seem necessary, especially for local LLMs).

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
import chainlit as cl
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

@cl.cache
def instantiate_llm():
    n_batch = (
        4096  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    )
    # Make sure the model path is correct for your system!
    llm = LlamaCpp(
        model_path="/Users/willydouhard/Downloads/yarn-llama-2-7b-128k.Q3_K_M.gguf",
        n_batch=n_batch,
        n_ctx=4096,
        temperature=1,
        max_tokens=10000,
        n_threads=64,
        verbose=True,  # Verbose is required to pass to the callback manager
        streaming=True,
    )
    return llm

llm = instantiate_llm()

@cl.on_chat_start
def main():
    template = """### System Prompt
The following is a friendly conversation between a human and an AI optimized to generate source-code. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know:

### Current conversation:
{history}

### User Message
{input}

### Assistant"""

    prompt = PromptTemplate(template=template, input_variables=["history", "input"])

    conversation = ConversationChain(
        prompt=prompt, llm=llm, memory=ConversationBufferWindowMemory(k=10)
    )

    cl.user_session.set("conv_chain", conversation)

@cl.on_message
async def main(message: str):
    conversation = cl.user_session.get("conv_chain")

    cb = cl.LangchainCallbackHandler(
        stream_final_answer=True, answer_prefix_tokens=["Assistant"]
    )

    res = await cl.make_async(conversation)(message, callbacks=[cb])

    # Do any post processing here

    await cl.Message(content=res["response"]).send()

Then I was able to see the token being streamed to the Chainlit UI. I used the 7B variant of the model you are using but it should work the same.

willydouhard commented 10 months ago

For the final answer streaming, it only works if the last step of the chain always start with the same prefix (like Final Answer). However if you know your chain only has one step, you can force final answer by manually setting answer_reached to True after instantiating the callback handler and before calling the chain.

cb.answer_reached = True

AlessandroSpallina commented 10 months ago

Many thanks for your fast response, the intermediate streaming worked like a charm! In order to have the final streaming I updated the callback accordingly to your suggestion:

    cb = cl.LangchainCallbackHandler(
        stream_final_answer=True, answer_prefix_tokens=["Response"]
    )

I just replaced answer_prefix_tokens=["Assistant"] with answer_prefix_tokens=["Response"]. This works because the last word of the prompt I'm using is "Assistant" and the LLM always completes the response by returning "Response" at first before actually responding to the question.

willydouhard commented 10 months ago

It was not the case for the 7B model but the one you use seems smarter!

Chainlit / chainlit

Working Llama 2 example? (LangChain ConversationChain) #345

Used LLM

Image 1

Additional info

Appendix: