Open r17652001 opened 3 weeks ago
Hi @r17652001
Could you please provide information about what filter you used, and how you created the vector store in each case? I see that you're accessing Chroma through the Langchain API, which may load the Chroma index and pass parameters like ef_search differently than we might expect.
A minimal example for each case would be useful to help us debug.
My filter conditions are as follows (using the hypothetical field name 'createuser' as an example) {'createuser': {'$in': ['android', 'bot']}}
The method for creating vector data is as follows. Does this help you?
`
import chromadb
from chromadb.utils import embedding_functions
gpt_emb_config = get_model_configuration('text-embedding-ada-002')
def chromadb_add(metadata, text, total_count=1):
logger.info('===== Add from vector database =====')
chroma_client = chromadb.PersistentClient(path=os.getenv("CHROMADB_PATH"))
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key = gpt_emb_config['api_key'],
api_base = gpt_emb_config['api_base'],
api_type = gpt_emb_config['openai_type'],
api_version = gpt_emb_config['api_version'],
deployment_id = gpt_emb_config['deployment_name']
)
collection = chroma_client.get_or_create_collection(
name="IMS_FILE",
metadata={"hnsw:space": "cosine"},
embedding_function=openai_ef)
id = get_chromadb_id()
ids =[]
for i in range(total_count):
ids.append(f"{id[:-4]}{int(id[-4:]) + i:04}")
collection.add(
documents=text,
metadatas=metadata,
ids=ids
)
`
What happened?
I've encountered a few questions recently while testing the documents
When I don't add any filtering conditions, the returned result is the content of Document B[Result 1] (which has a lower score and isn't the one I consider correct). However, occasionally, out of 10 queries, 1-2 times it finds the correct document. After adding filtering conditions, the chances of finding the correct document are 100% during testing[Result 2]. Is there any way to ensure that the correct document is found without adding filtering conditions?
When I added an extra parameter ("hnsw:search_ef":1000) while creating the vector store, the documents found were like [Result 3]. Even after removing this parameter, the results remained the same (tested about 20-30 times). Later, I restored the backup of chroma.sqlite3, and it reverted to the original results [Result 1]. Does adding this parameter affect the content inside the vector store? Subsequent adjustments to the parameters hnsw:construction_ef and hnsw:M still resulted in [Result 3]
Ps.Currently, there are approximately 32,501 records
`
`
Versions
chromadb==0.4.22 python==3.10.12 Ubuntu 22.04.3 LTS
Relevant log output