coexplain / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
0 stars 0 forks source link

[Bug]: Scores in the retrieved nodes are in reversed order in the Weaviate integration #1

Open williamhub opened 1 month ago

williamhub commented 1 month ago

Bug Description

Hello, I was using the retriever from a vector store index that has been initialized from a Weaviate collection. I noticed that the retrieved nodes have scores in reversed order: the first (most relevant) node, has score equals to zero and as we move to the least relevant nodes, the score increases.

We found in the code that LlamaIndex performs subtraction 1 - score, where score is the score that the Weaviate returns. But the Weaviate now, returns similarity score instead of distance. I think that only in vector (instead of hybrid) search, the distance can be returned instead of similarity (see here). You can use the code I provide below (from a Jupyter Notebook) in order to see the scores that LlamaIndex gives and the scores that Weaviate returns.

Version

llama-index==0.10.53 llama-index-vector-stores-weaviate==1.0.0 weaviate-client==4.6.5

Steps to Reproduce

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Document
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.vector_stores import VectorStoreQuery
from llama_index.core.schema import TextNode

from llama_index.embeddings.text_embeddings_inference import TextEmbeddingsInference
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core.node_parser import SimpleNodeParser

import weaviate
import os

from transformers import AutoTokenizer, AutoModel
import tiktoken
import requests
from IPython.display import Markdown, display

# In[ ]:
# Embeddings initialization: OpenAI
embed_model = OpenAIEmbedding(model="text-embedding-3-small", api_key=os.environ.get("OPEN_AI_API_KEY"))
tokenizer = tiktoken.encoding_for_model("text-embedding-3-small").encode

# In[ ]:
tokenizer_obj = tokenizer
# The chunk_size must be compatible with the sequence length of the embed_model_obj that is used.
chunk_size = 450
chunk_overlap = 50
# Initialize a node parser that we will use in the documents parsing.
# First initialize the TokenCountingHandler with our tokenizer and the CallbackManager with our token counter.
# And then the node parser.
token_counter_handler = TokenCountingHandler(tokenizer=tokenizer_obj)
callback_manager = CallbackManager([token_counter_handler])
node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size,
                                                  chunk_overlap=chunk_overlap,
                                                  callback_manager=callback_manager)

# In[66]:
client = weaviate.connect_to_local()

# In[127]:
# Now that the collection is already created we just connected to it.
vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="Test"
)

# In[128]:
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
                                                        embed_model=embed_model,
                                                        transformations=[node_parser],
                                                        show_progress=True)

# In[100]:
def get_wikipedia_article_text(title):
    url = "https://en.wikipedia.org/w/api.php"
    params = {"action": "query", "format": "json", "prop": "extracts", "explaintext": True, "titles": title}
    response = requests.get(url, params=params).json()
    page = next(iter(response["query"]["pages"].values()))
    return page.get("extract", "Article not found.")

python_doc_text = get_wikipedia_article_text("Python (programming language)")
lion_doc_text = get_wikipedia_article_text("Lion")
lion_paragraph = lion_doc_text[:1000]

# In[25]:
python_doc = Document(doc_id='1',
                      text=python_doc_text,
                      metadata={
                           "title_of_parental_document": "Python_(programming_language)",
                           "source": "https://en.wikipedia.org/wiki/Python_(programming_language)"
                       })

# In[101]:
lion_doc = Document(doc_id='2',
                    text=lion_paragraph,
                    metadata={
                       "title_of_parental_document": "Lion",
                       "source": "https://en.wikipedia.org/wiki/Lion"
                   })

# In[104]:
vector_store_index.insert(document=python_doc)
vector_store_index.insert(document=lion_doc)

# In[129]:
retriever = vector_store_index.as_retriever(similarity_top_k=10, 
                                            vector_store_query_mode="hybrid",
                                            alpha=0.5)
nodes = retriever.retrieve("What is lion?")

# In[131]:
# Always the retriever returns a list of nodes in decsending order based on the score (most relevant chunks going first in the list).
# But why here the most relevant chunk has a zero score?
for node in nodes:
    print(node.text)
    print()
    print(node.score)
    print("__________________________________________________________________________________________________________")
    print("__________________________________________________________________________________________________________")

print([node.score for node in nodes])
# The scores are: [0.0,  0.9217832833528519,  0.9288488179445267,  0.9365298748016357,  0.937725093215704,    #0.9396311119198799,  0.9409564286470413,  0.9446112886071205,  0.9455222226679325,  0.9476451091468334]

# In[108]:

# Code to query Weaviate without LlamaIndex.
query = "what is lion?"
query_vector = embed_model.get_query_embedding(query=query)

# In[121]:
# query_vector
from weaviate.classes.query import MetadataQuery

# In[123]:
from weaviate.classes.query import MetadataQuery
response = collection.query.hybrid(
    query=query,
    vector=query_vector,
    return_metadata=MetadataQuery(distance=True,
                             certainty=True,
                             score=True,
                             explain_score=True),
    alpha=0.50,
    limit=10,
)
x=[]
for o in response.objects:
    print(o.properties)
    print()
    print(o.metadata)
    print("_______________")
   x.append(o.metadata.score)

print(x)

# Scores from weaviate:
# [1.0,  0.08082851022481918,  0.07240726053714752,  0.07034952938556671,  0.0660715326666832,  #0.06566160172224045,  0.06270790100097656,  0.05426621064543724,  0.05242578685283661,  0.05184878036379814]

Relevant Logs/Tracbacks

No response

coexplain[bot] commented 1 month ago

Estimated Time

12 hours

Related Files

llama-index-integrations/vector_stores/llama-index-vector-stores-weaviate/llama_index/vector_stores/weaviate/utils.py