abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.94k stars 946 forks source link

Unexpected output of embed #1469

Open shizidushu opened 5 months ago

shizidushu commented 5 months ago

Here is the code to get embedding.

from llama_cpp import Llama

llm = Llama(
      model_path=r'D:\4-Working-Project\llama.cpp-lab\models\Alibaba-NLP--gte-Qwen1.5-7B-instruct.Q4_K_M.gguf',
      embedding=True
)

text = """
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear.
"""

res = llm.embed(text)

Here I get some info about the output result (show print result in comments):

print(len(res))
# 428

print(type(res), type(res[0]))
# <class 'list'> <class 'list'>

print(len(res[0]))
# 4096

For the model https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct, I think I should get an embedding of 4096 length, but I got 428 instead.

yentur commented 5 months ago

I think 428 is the number of tokens and len(res[0]) is the Embedding Dimension. the length of res depends on the number of tokens you enter. but the length of the element of res depends on the Embedding Dimension

shizidushu commented 5 months ago

@yentur Following your comments, it will be one embedding for one token. But I guess the 428 tokens should be compressed into a single embedding. (Refer to https://www.mongodb.com/developer/products/atlas/choose-embedding-model-rag/#choosing-the-right-embedding-model-for-your-rag-application)

iamlemec commented 4 months ago

It looks as though this particlar model uses last token pooling, which isn't currently supported by llama.cpp. It would be super easy to add, since we already have first token pooling, just hasn't come up with other models yet. The other option is to just get the token level embeddings and pick off the last token embedding manually.