[Bug]: Searching documents consistently returns low-scoring results

r17652001 commented 3 weeks ago

What happened?

I've encountered a few questions recently while testing the documents

When I don't add any filtering conditions, the returned result is the content of Document B[Result 1] (which has a lower score and isn't the one I consider correct). However, occasionally, out of 10 queries, 1-2 times it finds the correct document. After adding filtering conditions, the chances of finding the correct document are 100% during testing[Result 2]. Is there any way to ensure that the correct document is found without adding filtering conditions?
When I added an extra parameter ("hnsw:search_ef":1000) while creating the vector store, the documents found were like [Result 3]. Even after removing this parameter, the results remained the same (tested about 20-30 times). Later, I restored the backup of chroma.sqlite3, and it reverted to the original results [Result 1]. Does adding this parameter affect the content inside the vector store? Subsequent adjustments to the parameters hnsw:construction_ef and hnsw:M still resulted in [Result 3]

Ps.Currently, there are approximately 32,501 records

`

from langchain_community.vectorstores import Chroma
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
gpt_emb_config = get_model_configuration('text-embedding-ada-002')
embeddings = AzureOpenAIEmbeddings(
    deployment = gpt_emb_config['deployment_name'],
    model = gpt_emb_config['model'],
    openai_api_key = gpt_emb_config['api_key'],
    azure_endpoint = gpt_emb_config['api_base'],
    openai_api_type = gpt_emb_config['openai_type'],
    openai_api_version = gpt_emb_config['api_version']
)

def chromadb_load():
    #vectorstore = Chroma("DOC_FILE", embeddings, collection_metadata={"hnsw:search_ef":1000}, persist_directory=os.getenv("CHROMADB_PATH"))
    vectorstore = Chroma("DOC_FILE", embeddings, persist_directory=os.getenv("CHROMADB_PATH"))
    return vectorstore
def answer_question(question="who am I?", useremail="", filter_group_flag=True, tag_list={}, rag_date=""):
    #init vectordb & retriever
    vectorstore = chromadb_load()

    def querydb_node(state):
        question = state["question"]
        useremail = state["useremail"]

        if tag_exist:
            docs = vectorstore.similarity_search_with_relevance_scores(query=question, k=int(os.getenv("RAG_TOP_N")) , score_threshold=float(os.getenv("RAG_THRESHOLD")), filter=where_condition )
        else:
            docs = vectorstore.similarity_search_with_relevance_scores(query=question, k=int(os.getenv("RAG_TOP_N")) , score_threshold=float(os.getenv("RAG_THRESHOLD")) )
        logger.warning(f'docs:{docs}')

#RAG_THRESHOLD=0.8
#RAG_TOP_N="30"

`

Versions

chromadb==0.4.22 python==3.10.12 Ubuntu 22.04.3 LTS

Relevant log output

[Result1] Did not filter the retrieved document information
docs:[(Document(page_content='The content of document B', metadata={'createuser': 'bot'}), 0.8094891309738159), (Document(page_content='2.', metadata={'createuser': 'bot'}), 0.8081222176551819)]

[Result2] Filtered document information(createuser='android')
docs:[(Document(page_content='The content of document A', metadata={'createuser': 'android'}), 0.864965558052063), (Document(page_content='1', metadata={'createuser': 'android'}), 0.8357427716255188), (Document(page_content='xx', metadata={'createuser': 'android''}), 0.8196576237678528), (Document(page_content='xx', metadata={'createuser': 'android'}), 0.8077755570411682)]]

[Result3] Documents found by adding collection_metadata information when creating the vector store
docs:[(Document(page_content='The content of document C', metadata={'createuser': 'android'}), 0.806135558052063), (Document(page_content='1', metadata={'createuser': 'android'}), 0.8056345818827968), (Document(page_content='xx', metadata={'createuser': 'android''}), 0.8050076896580458), (Document(page_content='xx', metadata={'createuser': 'android'}), 0.80441291333186)]]

atroyn commented 2 weeks ago

Hi @r17652001

Could you please provide information about what filter you used, and how you created the vector store in each case? I see that you're accessing Chroma through the Langchain API, which may load the Chroma index and pass parameters like ef_search differently than we might expect.

A minimal example for each case would be useful to help us debug.

r17652001 commented 2 weeks ago

My filter conditions are as follows (using the hypothetical field name 'createuser' as an example) {'createuser': {'$in': ['android', 'bot']}}

The method for creating vector data is as follows. Does this help you?

`

import chromadb
from chromadb.utils import embedding_functions

gpt_emb_config = get_model_configuration('text-embedding-ada-002')
def chromadb_add(metadata, text, total_count=1):
    logger.info('===== Add from vector database =====')

    chroma_client = chromadb.PersistentClient(path=os.getenv("CHROMADB_PATH"))
    openai_ef = embedding_functions.OpenAIEmbeddingFunction(
        api_key = gpt_emb_config['api_key'],
        api_base = gpt_emb_config['api_base'],
        api_type = gpt_emb_config['openai_type'],
        api_version = gpt_emb_config['api_version'],
        deployment_id = gpt_emb_config['deployment_name']
    )
    collection = chroma_client.get_or_create_collection(
        name="IMS_FILE",
        metadata={"hnsw:space": "cosine"},
        embedding_function=openai_ef)

    id = get_chromadb_id()
    ids =[]
    for i in range(total_count):
        ids.append(f"{id[:-4]}{int(id[-4:]) + i:04}")

    collection.add(
        documents=text,
        metadatas=metadata,
        ids=ids
    )

`

chroma-core / chroma