chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.44k stars 1.3k forks source link

[Bug]: querying a persisted index returns an empty array #640

Closed dankolesnikov closed 1 year ago

dankolesnikov commented 1 year ago

What happened?

The following example uses langchain to successfully load documents into chroma and to successfully persist the data. Querying works as expected.

However, when we restart the notebook and attempt to query again without ingesting data and instead reading the persisted directory, we get [] when querying both using the langchain wrapper's method and chromadb's client (accessed from langchain wrapper). I've concluded that there is either a deep bug in chromadb or I am doing something wrong. Please help!

Code to reproduce

In the notebook paste:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import json

assembly_ai_output_file = "data_auto_chapters.json"

# Read the JSON file
with open(assembly_ai_output_file, "r") as json_file:
    json_data = json.load(json_file)

utterances = json_data['utterances']
data = list(map(lambda x: {key: value for key, value in x.items() if key != "words"}, utterances))
transcription_id = json_data['id']
docs = list(map(lambda x: Document(page_content=x['text'], 
                                   metadata={"transcription_id": transcription_id,"confidence": x['confidence'], "start": x['start'], "end": x['end'], "speaker": x['speaker']}),
                            utterances))

embedding = OpenAIEmbeddings()

persist_directory = 'db'
vectordb = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persist_directory, collection_name="condense_demo")
vectordb.persist()

query = "what does the speaker say about raytheon?"

retrieved_docs = vectordb.similarity_search(query) # filter={"speaker": "B"}
retrieved_docs # data comes back

Now lets restart the notebook, and run the following code which I believe should work.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embedding = OpenAIEmbeddings()

vectordb = Chroma(persist_directory="db", embedding_function=embedding, collection_name="condense_demo")
query = "what does the speaker say about raytheon?"

retrieved_docs = vectordb.similarity_search(query) # filter={"speaker": "B"}
retrieved_docs # returns []

To make sure its not langchain issue, lets get the chroma client from the Chroma class and query the data at a lower level.

vectordb._client.list_collections() # looks good
vectordb._client.get_collection("condense_demo").peek() # yes there is data there!
query_vector = embedding.embed_query(query) # use langchain implementation of openai embedding algorithm
vectordb._client.get_collection("condense_demo").query(query_texts=[query], n_results=4) 
# returns {'ids': [[]],
# 'embeddings': None,
# 'documents': [[]],
# 'metadatas': [[]],
# 'distances': [[]]}

Am I doing something wrong? I am attaching the JSON that is fed into the data as a zip file because GitHub doesn't allow json to make it easier to reproduce but i believe you can reproduce using any document (my assumption).

data_auto_chapters.json.zip

Versions

Python 3.8.8 langchain 0.0.184 chroma 0.3.25

Relevant log output

No response

jeffchuber commented 1 year ago

@dankolesnikov I just tried this example with the code and dataset you provided and it all seems to work great. This might be a weird notebook thing? Can you try this in python scripts? I tested it in 2 different notebooks - worked great!

Kefan-pauline commented 1 year ago

Faced the same issue.

I have created index using langchain in a notebook of one server, then zip, download, upload it to another server, unzip and use it in a notebook there.

Previously, I got NoIndexException while querying on the second server. I have repeated the index creation lots of times and finally got one version that works.

Now I upgraded the chromadb to the latest version, and got an empty array on the second server. I will try to repeat the index creation and hope for one version that will work...

For the index creation, I use:

from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator(embedding=embed_model, text_splitter=splitter, vectorstore_kwargs={"persist_directory": 'langchainindex'}).from_documents(documents_list)
index.vectorstore.persist()

On the second server, I use:

from utils import VectorstoreIndexCreator
index = VectorstoreIndexCreator(embedding=embed_model).from_persistent_index('langchainindex')
index.vectorstore.persist()

Note that I have changed a bit VectorstoreIndexCreator:

class VectorstoreIndexCreator(BaseModel):

    # Existing code ....

    def from_persistent_index(self, path: str) -> VectorStoreIndexWrapper:
        """Load a vectorstore index from a persistent index."""
        vectorstore = self.vectorstore_cls(persist_directory=path, embedding_function=self.embedding)
        return VectorStoreIndexWrapper(vectorstore=vectorstore)
jeffchuber commented 1 year ago

@Kefan-pauline can you provide a minimal reproducible example? it is hard to figure out what is happening here 🤔

dankolesnikov commented 1 year ago

@Kefan-pauline why would you call .persist() when reading from an existing index since it has been already persisted? Could it lead to duplicate data?

@jeffchuber It is not a notebook issue as I initially ran into this bug in the python script. My workflow is:

  1. Create and persist the index in the notebook.
  2. Copy the db folder that contains index and its data that was created in step 1 and paste in python server.
  3. Run the code to query that index.

Jeff, when you reproduced my example - did you restart the kernel clearing all output cells before running the second piece of code I provided? Things are expected to work if you didn't restart the kernel.

Kefan-pauline commented 1 year ago

I just got one version that works. It was either a chance or because I did things too fast previously, leaving some time between different steps might help.

@dankolesnikov I think .persist() is not required anymore while reading for the latest version. Now when I restart the kernel, chroma-collections.parquet and chroma-embeddings.parquet get updated automatically, which was not the case before.

lexsf commented 1 year ago

I'm getting this as well @jeffchuber. As long as the process is running, I can create a chroma client and read the persisted data. If the process ends and I try to re-read the data, then it reads a collection of count 1. I'm on . . . Windows . . . which I know, long story.

Also, I'm not using LangChain, I'm using raw Chroma.

jeffchuber commented 1 year ago

@lexsf @Kefan-pauline @dankolesnikov 😢

I'd really like to help here but I can't reproduce it.

I created this repo

https://github.com/jeffchuber/chroma_debugging

the folder one uses load_data.ipynb to load in a data set. I demonstrate in use.py and use.ipynb how to load that.

Then I also copy and pasted the folder into two and create a flask app python main.py and it also seems to work...

dankolesnikov commented 1 year ago

@jeffchuber I will run through your repo right now and will report back on whether it works for me or not. Quick question, in your code below you specify chroma_db_impl="duckdb+parquet, is that important? With langchain users never touch that parameter.

client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./data",
))
dankolesnikov commented 1 year ago

@jeffchuber When I ran the first notebook in your repo I get the following error when running the last cell, I also saw it previously. Have you seen that before? Am I missing something?

CleanShot 2023-06-01 at 17 17 31@2x

dankolesnikov commented 1 year ago

@jeffchuber I think I am able to narrow the bug / inconsistency. I've recorded a detailed loom, you can see the bug in all its glory starting 3:05: https://www.loom.com/share/5b7307cdbe244ea7868092989d2172a3

I had a hard time triggering it but i think i found the pattern, if you want the loom video in full you can notice the pattern below in action.

Works fine: create and persist data in chroma restart kernel read from the persisted folder all good

Doesn't work: create and persist data in chroma delete folder with persisted data without restarting kernel recreate the folder restart kernel (if you want) attempt to read from persisted folder, you will get [] anytime you restart the kernel you can't read from the folder, you will receive [] when quering

dankolesnikov commented 1 year ago

@jeffchuber do you have any hunches? I have a few, but tbh this still doesn't explain why things didn't work for me the very first time.

jeffchuber commented 1 year ago

@dankolesnikov I think this is the problem

https://github.com/chroma-core/chroma/blob/main/chromadb/db/duckdb.py#L412

I think the atExit magic for saving get inconsistently called via process exit especially in the case of many clients being generated.

Let me try what you said too...

HammadB commented 1 year ago

I suspect this bug is due to langchain implicitly creating many different chroma clients under the hood, which will not work. The chroma client should be treated as a singleton, and we should separately add some validation for this.

What is happening is that these separate clients are stomping on each other atExit.

Langchain supports passing in a client OR the settings/persist directory. Can people experiencing this issue instead pass the client into Chroma.from_documents and similar functions? This should resolve your issue. Thanks!

Apologies for the confusion.

dankolesnikov commented 1 year ago

That makes sense guys! Thank you. Please advise if a langchain ticket should be created or if there is an existing one? I think we can close this one? @jeffchuber I think this is an important area to fix because people use langchain a lot to interface with vector dbs and results in folks churning from chroma.

Does Chroma team own the development and maintenance of the Chroma implementation in Langchain?

jeffchuber commented 1 year ago

Just chatted with Harrison about a few things earlier today! We own the Chroma implementation (with a ton of community help), but this is going to hard for us to enforce.... it seems the real "offending" method are from_documents and from_texts since that is the footgun here.

I'll need to think more about this.

@dankolesnikov any ideas?

dankolesnikov commented 1 year ago

I will noodle on this more and can propose a solution, happy to help contribute on this front, I am familiar with the Chroma implementation in langchain partially due to the work on Auto Evaluator: https://autoevaluator.langchain.com/

jeffchuber commented 1 year ago

@dankolesnikov that would be really great! 🎉

satishmaddula commented 1 year ago

@dankolesnikov @jeffchuber I am also facing the same issue , I am using chromadb with langchain , any temporary fix you guy's would suggest on this ?

sbslee commented 1 year ago

Hello team. Any updates on this? Another langchain user with the same issue. Thanks for your hard work!

jeffchuber commented 1 year ago

@satishmaddula @sbslee does this help? https://docs.trychroma.com/troubleshooting#your-index-resets-back-to-just-a-few-number-of-records

sbslee commented 1 year ago

@jeffchuber,

Thanks for the reply. The link does give me some insight as to why this problem occurs. I'm new to both langchain and chromadb so I will need to study more, but can I ask a follow-up question?

This problem seems to be Windows-specific (another user in this thread also appears to be a Windows user), because I'm a Mac user and this never happened to me but it always happens to my coworker who uses Windows. The problem also seems somewhat random because it's really difficult to reproduce -- once you create a vector store the chatbot runs perfectly fine for a while but then all of a sudden it stops retrieving request data and my coworker is always forced to recreate a new vector store and the cycle goes on. I know it's a pain to address an issue that is not reproducible and that's why I haven't raised any issues in this repo (I just told my coworker to constantly recreate a new vector space) but I just came across this post and I thought I would chime in. (My coworker and) I would greatly appreciate if you can address these concerns.

To give you more specific context of our langchain + chromadb implementation, I will share my code:

https://github.com/sbslee/kanu/blob/d1ae859b81413c19dae446a2d7b872926205c56e/kanu/docgpt.py#L125-L135

Basically, I'm building a GUI for various chatbots and one of them is called DocGPT which allows you to chat with your documents. Users have a choice to either create a new database or use an existing database. Everything works fine to me but whenever my coworker tries to use an existing database, 10% of the time the chatbot goes blind to the database and keeps complaining that the requested data is not there even though it gave a perfect answer a minute ago. When this happens, the chatbot is gone because no matter how many times my coworker tries to connect to the database, the request data is not there. Of course, I suspected that maybe some data were overwritten as the provided link suggests but when I look at the index and parquet files (chroma-collections.parquet and chroma-embeddings.parquet) of my coworker's computer the time mark of last modification is unchanged indicating to me that no data were overwritten.

jeffchuber commented 1 year ago

@sbslee that same small snippet of code works for you, but not your coworker on windows?

sbslee commented 1 year ago

@jeffchuber, there is a little more nuance, but basically, yes. Note again that the issue doesn't happen 100% of the time even for my coworker (roughly 10%). We are testing to use a different vector store and see if the same problem occurs.

sbslee commented 1 year ago

@jeffchuber,

I stand corrected. Previously, I mentioned:

Of course, I suspected that maybe some data were overwritten as the provided link suggests but when I look at the index and parquet files (chroma-collections.parquet and chroma-embeddings.parquet) of my coworker's computer the time mark of last modification is unchanged indicating to me that no data were overwritten.

This is not true! I made my coworker monitor carefully the contents of the chroma-collections.parquet and chroma-embeddings.parquet files. We found that when the chatbot was no longer able to retrieve requested data (reminding you, this does not happen 100% of the time but when it does happen the chatbot is gone for good), some of the contents of the chroma-embeddings.parquet file were indeed deleted. We are still investigating what triggers this data deletion event, but I agree with you 100% this issue is tightly associated with the link you suggested. I will give you an update when we discover what prompts this issue.

jeffchuber commented 1 year ago

@sbslee we are releasing an updated version of chroma next wednesday that will persist on every write and will totally this weird runtime behavior. stay tuned!

sbslee commented 1 year ago

We are still investigating what triggers this data deletion event

I can confirm that this was triggered when my coworker tried to re-connect to the database while the temp parquet files were being processed (some of his documents were huge and it took some time to process these files). The reason it appeared random at first was because we didn't realize the existence of these temp files (they only exist, well, temporarily so we never spotted them). But when he started waiting for these temp files to clear up before re-connecting to the database, the problem is now gone. Hope this helps future users.

jeffchuber commented 1 year ago

@sbslee oh this is great to know! our release later today (announcing monday) switches to incremental persisting and will avoid this

jeffchuber commented 1 year ago

Closing this as it is stale. Please let me know if anything else pops up here and we can re-open it.

bzr1 commented 5 months ago

this happened again on my end now @jeffchuber , some how when I run the code below to try to load a saved vecter store, it creates a new folder (by that I mean, the chorma folder used to have 2 items, one is chorma.sqlite3 file, the other is a folder containing {header.bin, index_metadata.pickle and such}, it creates a empty folder with name similar to that folder) but noting in the empty folder and even if I deleted the new folder, it still would not work

local_model_path = '/content/drive/MyDrive/AIOPS_RAG_Utils/distiluse-base-multilingual-cased-v1'
embedding_function =  SentenceTransformerEmbeddings(model_name=local_model_path)
vectorstore=Chroma(persist_directory="/content/drive/MyDrive/AIOPS_RAG_Utils/chroma_db_2", embedding_function=embedding_function)
retriever = vectorstore.as_retriever()

docs = retriever.get_relevant_documents('how  is xx')
docs
dosatos commented 5 months ago

@bzr1 make sure you've set the collection_name correctly