ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.53k stars 9.84k forks source link

Bug: quality decreases in embeddings models #9695

Closed Maxon081102 closed 2 weeks ago

Maxon081102 commented 2 months ago

What happened?

I tried to use the jini model with mean pooling, sbert like in tutorial and my custom bert with gpt2 tokenizer and mean pooling, and each time the quality dropped,and the average cosine distance of the embeddings was from 0.7 to 0.9, but when i do everything according to the tutorial (I used arctic model for getting results) cosine distance is 0.999

It seems that the quality should not fall. Tell me please, what is wrong? Maybe there are problems with using an unusual bert models?

I converted the models using linux, running it on windows with ollama

A Simple file.py for getting results

import requests
import json
import torch
import logging
import numpy as np
from typing import List, Dict
from transformers import AutoTokenizer, AutoModel
from numpy.linalg import norm
from tqdm import tqdm
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

from datasets import load_dataset

logger = logging.getLogger(__name__)

device = "cpu"

class MyModel:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True, add_pooling_layer=False).to(device)
        self.model_name = model_name
        self.tokenizer.add_eos_token = False

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return sum_embeddings / sum_mask

    def encode_text(self, texts: List[str], batch_size: int = 12, max_length: int = 128) -> np.ndarray:
        logging.info(f"Encoding {len(texts)} texts...")

        embeddings = []
        for i in tqdm(range(0, len(texts), batch_size), desc="Encoding batches", unit="batch"):
            batch_texts = texts[i:i+batch_size]
            encoded_input = self.tokenizer(batch_texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt", add_special_tokens=False).to(device)
            with torch.no_grad():
                model_output = self.model(**encoded_input)
            batch_embeddings = self.mean_pooling(model_output, encoded_input['attention_mask'])
            embeddings.append(batch_embeddings.cpu())

        embeddings = torch.cat(embeddings, dim=0)

        if embeddings is None:
            logging.error("Embeddings are None.")
        else:
            logging.info(f"Encoded {len(embeddings)} embeddings.")

        return embeddings.numpy()

ds = load_dataset("alpindale/light-novels")
data = []
for text in tqdm(ds['train'][:1000]['text']):
    if text == "" or text == " ":
        continue
    data.append(text)
url = "http://localhost:11434/api/embeddings"
model = MyModel(model_name='jinaai/jina-embeddings-v2-small-en')

def check(text):
    prompt = text[:768]
    data = {
        "model": "jina_small",
        "prompt": prompt,
    }
    json_data = json.dumps(data)

    headers = {"Content-Type": "application/json"}
    response = requests.post(url, data=json_data, headers=headers)

    emb = json.loads(response.text)['embedding']
    emb = np.array(emb)

    res = model.encode_text([prompt])
    return emb, res[0]

res = []
for text in tqdm(data):
    emb, pred = check(text)
    res.append(cos_sim(emb, pred))
print(np.mean(res), len(data))

Name and Version

llama.cpp: build: 3771 (acb2c32c) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

ollama version is 0.1.41

What operating system are you seeing the problem on?

No response

Relevant log output

python llama.cpp/convert_hf_to_gguf.py models--jinaai--jina-embeddings-v2-small-en/snapshots/796cff318cdd4e5fbe8b7303a1ef8cbec36996ef --outfile jinaai_small_test.gguf --outtype f16
INFO:hf-to-gguf:Loading model: 796cff318cdd4e5fbe8b7303a1ef8cbec36996ef
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd_norm.bias,           torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:token_embd_norm.weight,         torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:token_types.weight,             torch.float16 --> F32, shape = {512, 2}
INFO:hf-to-gguf:token_embd.weight,              torch.float16 --> F16, shape = {512, 30528}
INFO:hf-to-gguf:blk.0.attn_output_norm.bias,    torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_output_norm.weight,  torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_output.bias,         torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_output.weight,       torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.0.attn_k.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_k.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.0.attn_q.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_q.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.0.attn_v.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.attn_v.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,          torch.float16 --> F16, shape = {512, 2048}
INFO:hf-to-gguf:blk.0.ffn_up.weight,            torch.float16 --> F16, shape = {512, 2048}
INFO:hf-to-gguf:blk.0.layer_output_norm.bias,   torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.layer_output_norm.weight, torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.ffn_down.bias,            torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.0.ffn_down.weight,          torch.float16 --> F16, shape = {2048, 512}
INFO:hf-to-gguf:blk.1.attn_output_norm.bias,    torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.attn_output_norm.weight,  torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.attn_output.bias,         torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.attn_output.weight,       torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.1.attn_k.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.attn_k.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.1.attn_q.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.attn_q.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.1.attn_v.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.attn_v.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.1.ffn_gate.weight,          torch.float16 --> F16, shape = {512, 2048}
INFO:hf-to-gguf:blk.1.ffn_up.weight,            torch.float16 --> F16, shape = {512, 2048}
INFO:hf-to-gguf:blk.1.layer_output_norm.bias,   torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.layer_output_norm.weight, torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.ffn_down.bias,            torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.1.ffn_down.weight,          torch.float16 --> F16, shape = {2048, 512}
INFO:hf-to-gguf:blk.2.attn_output_norm.bias,    torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.attn_output_norm.weight,  torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.attn_output.bias,         torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.attn_output.weight,       torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.2.attn_k.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.attn_k.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.2.attn_q.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.attn_q.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.2.attn_v.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.attn_v.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.2.ffn_gate.weight,          torch.float16 --> F16, shape = {512, 2048}
INFO:hf-to-gguf:blk.2.ffn_up.weight,            torch.float16 --> F16, shape = {512, 2048}
INFO:hf-to-gguf:blk.2.layer_output_norm.bias,   torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.layer_output_norm.weight, torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.ffn_down.bias,            torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.2.ffn_down.weight,          torch.float16 --> F16, shape = {2048, 512}
INFO:hf-to-gguf:blk.3.attn_output_norm.bias,    torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.attn_output_norm.weight,  torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.attn_output.bias,         torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.attn_output.weight,       torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.3.attn_k.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.attn_k.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.3.attn_q.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.attn_q.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.3.attn_v.bias,              torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.attn_v.weight,            torch.float16 --> F16, shape = {512, 512}
INFO:hf-to-gguf:blk.3.ffn_gate.weight,          torch.float16 --> F16, shape = {512, 2048}
INFO:hf-to-gguf:blk.3.ffn_up.weight,            torch.float16 --> F16, shape = {512, 2048}
INFO:hf-to-gguf:blk.3.layer_output_norm.bias,   torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.layer_output_norm.weight, torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.ffn_down.bias,            torch.float16 --> F32, shape = {512}
INFO:hf-to-gguf:blk.3.ffn_down.weight,          torch.float16 --> F16, shape = {2048, 512}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 8192
INFO:hf-to-gguf:gguf: embedding length = 512
INFO:hf-to-gguf:gguf: feed forward length = 2048
INFO:hf-to-gguf:gguf: head count = 8
INFO:hf-to-gguf:gguf: layer norm epsilon = 1e-12
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type unk to 100
INFO:gguf.vocab:Setting special token type sep to 102
INFO:gguf.vocab:Setting special token type pad to 0
INFO:gguf.vocab:Setting special token type cls to 101
INFO:gguf.vocab:Setting special token type mask to 103
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:jinaai_small_test.gguf: n_tensors = 68, total_size = 64.9M
INFO:hf-to-gguf:Model successfully exported to jinaai_small_test.gguf
jpohhhh commented 2 months ago

What does quality dropping mean?

If I understand right, you're comparing a custom embedding model to Jini, and the custom embedding model is worse than Jini. If that is true, it sounds like a model issue, not llama.cpp.

Maxon081102 commented 2 months ago

What does quality dropping mean?

If I understand right, you're comparing a custom embedding model to Jini, and the custom embedding model is worse than Jini. If that is true, it sounds like a model issue, not llama.cpp.

This means that when I convert a model to gguf the quality drops on metrics such as MRR and others

No, I use the same model via transformers and llama.cpp, and for some reason the embeddings of llama.cpp are very different, although I converted everything according to the tutorial

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.