Open Fuehnix opened 5 months ago
I'm getting the same issue, so would be good to know if you found a solution.
What worked for me was just to downgrade to llama-cpp-python==0.2.47
Yeah, right now we don't support getting token level embeddings. So generative models like llama-2 that lack pooling layers won't work.
Are you looking for token level embeddings or sequence level embeddings? If the latter, I would use an embedding model like BAAI/bge-*. This is a more typical approach.
It might actually be a decent idea to just return token level embeddings when sequence level aren't available.
Should be related to #1269, 0.2.55 also still works for me.
I'm getting the same issue, so would be good to know if you found a solution.
I ended up using 0.2.55, and it seems others reached the same conclusion. Later, I ended up switching off llama.cpp for the embedding part, but before I had it working with 0.2.55.
Yeah, right now we don't support getting token level embeddings. So generative models like llama-2 that lack pooling layers won't work.
Are you looking for token level embeddings or sequence level embeddings? If the latter, I would use an embedding model like BAAI/bge-*. This is a more typical approach.
It might actually be a decent idea to just return token level embeddings when sequence level aren't available.
I guess I was looking for sequence level embeddings? I was naively using llama2 for embeddings just to see if things worked, but I wasn't aware of any low level problems with doing that. I've since switched to mpnet from HuggingFaceEmbeddings in Langchain for much better quality results (while using llama-cpp-python for inference).
@Fuehnix sorry about the trouble, working on a fix to just enable the older behaviour by default in #1272 .
Also running into this issue. Have tried all the way up to v0.2.61, seems like only v0.2.55 is working.
same error with version 0.2.60
Still an issue on 0.2.63 and 0.2.64.
Still an issue on 0.2.75
@r3v1 Is it still raising an error, or is it just that it's returning token level embeddings as a list of lists? Generative models like these don't do pooling intrinsically in llama.cpp
, and in fact it's not really recommended to use them for embedding purposes. But if you do need pooled embeddings, you'll have to do it manually from the token level embeddings.
What if I would like to store embeddings in a vector store through Langchain? It shoul return single dimension vector.
In recent llama-cpp-python versions, when pooling_type=LLAMA_POOLING_TYPE_MEAN
throws:
...
Guessed chat format: llama-3
GGML_ASSERT: /home/david/git/llama-cpp-python/vendor/llama.cpp/llama.cpp:11171: lctx.inp_mean
ptrace: Operación no permitida.
No stack.
The program is not being run.
[1] 29907 IOT instruction (core dumped)
The MWE:
import llama_cpp
from llama_cpp import LLAMA_POOLING_TYPE_MEAN
llm = llama_cpp.Llama(
model_path="meta-llama-3-8b-instruct.Q4_K_M.gguf",
embedding=True,
pooling_type=LLAMA_POOLING_TYPE_MEAN, # Crashes
)
llm.create_embedding(["Hello world"])
Otherwise, without specifying pooling_type
, returns token level embeddings.
However, version 0.2.55 works as wanted, just sentence level embedding.
Yeah, the langchain interop code is unforunately broken right now for getting embeddings from generative models. For it to work in this case, we'd need to implement manual pooling somewhere. But if you're doing anything like retrieval or classification, you can get much better results with smaller embedding models like bge-*
/jina
/nomic
that work as expected here. Checkout the MTEB leaderboard on Huggingface.
I think 0.2.55 should work fine in this case, though I suspect it may fail or crash if you try to do it with more than one sequence per call to create_embedding
.
Sure, I will take a look. I was trying to do all steps of the RAG with unique model for some experimentation
If you set
pooling_type=llama_cpp.LLAMA_POOLING_TYPE_NONE,
it should work fine. I havent tested builds later than 0.2.68
but that one seems to work fine.
I believe LLAMA_POOLING_TYPE_MEAN
crashes on older models that lack certain data, so likely cannot use that at all.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected
llama-cpp-python
to do. I tried running a simple hello world embedding query to make sure llama-cpp-python was working after doing a clean install of CentOS 9 with CUDA, Python 3.11.8, VSCode and the works. Code:Expected output (this is the output given for this code when I downgrade the llama-cpp-python package to 0.2.55):
Current Behavior
Please provide a detailed written description of what
llama-cpp-python
did, instead. When using 0.2.57 of llama-cpp-python (the version autoinstalled by pip):Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
1. install python 3.11.8, with CUDA 12.4 and drivers. sudo install all the required backend modules for python such as:
2. Set up a venv in vscode and install required packages (in my scenario, I believe I may have initially allowed llama-cpp-python to install without specifying CMAKE args, not sure if this is the root cause, but I believe this should have only resulted in me running on CPU based default and later reinstalling, right?)
Run the provided HelloWorld embedding query with llama_cpp Code:
from llama_cpp import Llama
llm = Llama(model_path="/home/jfuehne/Desktop/AI/Code/models/llama-2-13b-chat.Q5_K_M.gguf", embedding=True)
print(llm.create_embedding("Hello world!"))