ggerganov / llama.cpp

LLM inference in C/C++
MIT License
63.96k stars 9.17k forks source link

toy LlamaCppEmbeddings causes ggml allocation failure #3689

Closed devzzzero closed 4 months ago

devzzzero commented 10 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

I expected the toy embedding example to "work"

Current Behavior

Please provide a detailed written description of what llama.cpp did, instead.

ggml_allocr_alloc: not enough space in the buffer (needed 442368, largest block available 290848) GGML_ASSERT: C:\Users\jason\AppData\Local\Temp\pip-install-4x0xr_93\llama-cpp-python_fec9a526add744f5b2436cab2e2c4c28\vendor\llama.cpp\ggml-alloc.c:173: !"not enough space in the buffer"

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

I tried adopting the first toy vector db embedding example from https://learn.activeloop.ai/courses/take/langchain/multimedia/46317643-langchain-101-from-zero-to-hero here is the code:

import os
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

from langchain.embeddings import LlamaCppEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import LlamaCpp
from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# instantiate the LLM and embeddings models
llm = LlamaCpp(model_path="llama-2-13b-chat.Q5_K_M.gguf",
               temperature=0,
               max_tokens=1000,
               top_p=1,
               Verbose=True)
embeddings = LlamaCppEmbeddings(model_path="llama-2-13b-chat.Q5_K_M.gguf")

# create our documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638"
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<SOME_ID>"
my_activeloop_dataset_name = "langchain_llama_00"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever())

from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType

tools = [
    Tool(
        name="Retrieval QA System",
        func=retrieval_qa.run,
        description="Useful for answering questions."
    ),
]

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

response = agent.run("When was Napoleone born?")
print(response)

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

This MAY be a bug, or it maybe newbie fail (apologies if this turns out to be the latter)

Steps to Reproduce

I took https://learn.activeloop.ai/courses/take/langchain/multimedia/46317643-langchain-101-from-zero-to-hero and modified the "napoleon" example for simple vector embeddings and tried plugging in llama-cpp-python

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Here is a sample run from running "main.exe"

...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size = 81.13 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
Input prefix: ' '
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

Please close your issue when it has been answered.

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.