Closed tybalex closed 4 months ago
Maybe related to https://github.com/ggerganov/llama.cpp/pull/5796
Maybe related to #5796
I think so. hopefully will be fixed by that.
Can you check the cosine distance of vector produced by embedding.cpp vs server.cpp?
Also maybe try without GPU offloading?
Can you check the cosine distance of vector produced by embedding.cpp vs server.cpp?
Also maybe try without GPU offloading?
I tried without GPU offloading, got the same output.
As for the cosine distance, I calculated cosine distance between word prince
and a list of words ["king", "queen", "apple", "orange"]
and sorted them:
from embedding.cpp output: [('king', 0.4116078336488638), ('queen', 0.4211467172721288), ('apple', 0.6682980126084468), ('orange', 0.6874219028515791)]
from server.cpp: [('orange', 0.009215513122614483), ('king', 0.009233457008902879), ('queen', 0.01777521161063844), ('apple', 0.020477966154721194)]
There's currently a refactoring on server code, maybe this will be fixed: #5882
It looks like this is actually a tokenization issue. I'm seeing the output of /tokenize
as being pretty garbled. First, it doesn't appear to be adding an BOS token. We're currently not specifying the add_bos_token
flag in the GGUF files for embeddings, so we might want to do that.
Second, it looks like something is up with special_tokens_cache
. It seems to be adding in any token that is the concatenation of two other valid tokens. But that ends up being tons of regular words in addition to actual special tokens. The cache isn't used for regular embeddings, but the server seems to want to use it.
Edit: If you force it to add a BOS token and turn off special token processing, the tokenization comes out correct. And in that case the embedding numbers are correct too, though they're not normalized, so they won't look the same as the output from embedding
.
Yes, the special
flag is always on in server
:
And this seems to tokenize incorrectly. Not sure if this is somehow a problem with the vocab or if we simply need to turn off special
flag when using embeddings models.
We should fix this and the normalization after we merge #5882
Trying to figure out what's up with special_tokens_cache
and looking through https://github.com/ggerganov/llama.cpp/pull/3538 for guidance. Most models I'm looking at seem to correctly label special tokens with token_type
. Do we have any examples of models that fail to do this properly? Seems like the kind of thing that should be taken care of during GGUF conversion.
As I posted above, the embedding I got from embedding.cpp
is the same with what I got from the origin model, so guess it's not a GGUF conversion issue. My observation is, with the SAME input and SAME gguf model, embedding.cpp and server.cpp yield different output.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug. System: Mac M2 Max, OS version Sonoma 14.2.1 llama.cpp version: the latest main branch as of today -- Feb 29 2024 Steps to reproduce:
python convert-hf-to-gguf.py --outfile minilm.gguf --outtype f16 all-MiniLM-L6-v2
output:
./server -ngl 99 -m minilm.gguf --port 8019 --host 0.0.0.0 --embedding
Output:
Expected Behavior: the embedding from these 2 approaches should yield the same output
Actual Behavior: As you can see, the output embedding looks completely different from the one from step 3, not only the values, but the scales are different too.
============================================================= And by the way, the embedding output I get from step 3 is almost the same with the one I got from using sentence_transformer python library, for example:
This indicates that the model conversion works correctly. I think there's something wrong with the Bert Embedding of server mode.