Closed thiswillbeyourgithub closed 5 months ago
@thiswillbeyourgithub thanks for reporting, it's related to #1263
The issue is a null pointer is returned from the new get_embeddings_seq
if pooling type is not set, I've set it to mean now. @iamlemec does that sound correct?
I think the issue here is that gemma-2b
is not an embedding model. This should work if you use something like bge-base-en-v1.5
(GGUF here). I think setting the default pooling to unspecified as in the recent commit is the right route. Ultimately, for the error message, you may want to say that the model doesn't support sequence embeddings, which will be the case when hparams.pooling_type
is LLAMA_POOLING_TYPE_UNSPECIFIED
.
Part of the problem is that the pooling layer is actually considered part of the model, not something that can be applied to arbitrary models ex post, though this could obviously change. So right now if you want to get embeddings from generative LLMs, you need to set LLAMA_POOLING_TYPE_NONE
, use llama_get_embeddings_ith
, and manually pool the token level embeddings however you'd like. We actually made an example that does this with GritLM, which is a dual use model that does both generation and embeddings (see examples/gritlm
in llama.cpp
).
@iamlemec thanks, that makes sense. Would it then make sense to use pooling type unspecified by default then check if the result of get_embeddings_seq
is null and if it is we use get_embeddings_ith
?
Having unspecified as default makes sense. The issue with falling back to get_embeddings_ith
is that it'll give you the ith token, not sequence. So in that case, I think you either need to just say "this model doesn't do embeddings" or implement pooling on the python side (basically first token or mean pooling). Another option would be to give the user a way to just get the token level embeddings and let them figure it out (which would be useful for ColBERT style approaches).
Thank you all. Can anyone tell me how I should proceed to get the embeddings of each token of a sentence ? I could then do the pooling myself at least
@thiswillbeyourgithub new update! With the latest code on main
you can pass pooling_type=LLAMA_POOLING_TYPE_NONE
to the constructor and it will then give you token level embeddings.
Thank you very much!
@thiswillbeyourgithub new update! With the latest code on
main
you can passpooling_type=LLAMA_POOLING_TYPE_NONE
to the constructor and it will then give you token level embeddings.
Can this be used with LlamaCppEmbeddings ?
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
As seen on 0.2.55:
Current Behavior
As seen on 0.2.56
Environment and Context
The command I use to switch versions:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.2.55 --no-cache-dir
Linux REDACTED-MS-7758 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Failure Information (for bugs)
The failure happens no matter what arguments I give when loading the model, happens also for the non quantized model, does not happen when loading the model for text generation but at least happens for embeddings.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
from pathlib import Path ; import llama_cpp ; llm = llama_cpp.Llama(model_path=Path("gemma-2b-q4_K_M.gguf").absolute().__str__(), embedding=True) ; llm.create_embedding("Hi")
I confirm this code is not happening when running ./embedding from llamacpp with as latest commit
b3d978600f07f22e94f2e797f18a8b5f6df23c89
.I just need to use gemma for langchain embeddings so using 0.2.55 is fine to me, just making a heads up for devs :)