Closed dankolesnikov closed 1 year ago
@dankolesnikov I just tried this example with the code and dataset you provided and it all seems to work great. This might be a weird notebook thing? Can you try this in python scripts? I tested it in 2 different notebooks - worked great!
Faced the same issue.
I have created index using langchain in a notebook of one server, then zip, download, upload it to another server, unzip and use it in a notebook there.
Previously, I got NoIndexException while querying on the second server. I have repeated the index creation lots of times and finally got one version that works.
Now I upgraded the chromadb to the latest version, and got an empty array on the second server. I will try to repeat the index creation and hope for one version that will work...
For the index creation, I use:
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator(embedding=embed_model, text_splitter=splitter, vectorstore_kwargs={"persist_directory": 'langchainindex'}).from_documents(documents_list)
index.vectorstore.persist()
On the second server, I use:
from utils import VectorstoreIndexCreator
index = VectorstoreIndexCreator(embedding=embed_model).from_persistent_index('langchainindex')
index.vectorstore.persist()
Note that I have changed a bit VectorstoreIndexCreator:
class VectorstoreIndexCreator(BaseModel):
# Existing code ....
def from_persistent_index(self, path: str) -> VectorStoreIndexWrapper:
"""Load a vectorstore index from a persistent index."""
vectorstore = self.vectorstore_cls(persist_directory=path, embedding_function=self.embedding)
return VectorStoreIndexWrapper(vectorstore=vectorstore)
@Kefan-pauline can you provide a minimal reproducible example? it is hard to figure out what is happening here 🤔
@Kefan-pauline why would you call .persist()
when reading from an existing index since it has been already persisted? Could it lead to duplicate data?
@jeffchuber It is not a notebook issue as I initially ran into this bug in the python script. My workflow is:
db
folder that contains index and its data that was created in step 1 and paste in python server.Jeff, when you reproduced my example - did you restart the kernel clearing all output cells before running the second piece of code I provided? Things are expected to work if you didn't restart the kernel.
I just got one version that works. It was either a chance or because I did things too fast previously, leaving some time between different steps might help.
@dankolesnikov I think .persist()
is not required anymore while reading for the latest version. Now when I restart the kernel, chroma-collections.parquet
and chroma-embeddings.parquet
get updated automatically, which was not the case before.
I'm getting this as well @jeffchuber. As long as the process is running, I can create a chroma client and read the persisted data. If the process ends and I try to re-read the data, then it reads a collection of count 1. I'm on . . . Windows . . . which I know, long story.
Also, I'm not using LangChain, I'm using raw Chroma.
@lexsf @Kefan-pauline @dankolesnikov 😢
I'd really like to help here but I can't reproduce it.
I created this repo
https://github.com/jeffchuber/chroma_debugging
the folder one
uses load_data.ipynb
to load in a data set. I demonstrate in use.py
and use.ipynb
how to load that.
Then I also copy and pasted the folder into two
and create a flask app python main.py
and it also seems to work...
@jeffchuber I will run through your repo right now and will report back on whether it works for me or not.
Quick question, in your code below you specify chroma_db_impl="duckdb+parquet
, is that important? With langchain users never touch that parameter.
client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="./data",
))
@jeffchuber When I ran the first notebook in your repo I get the following error when running the last cell, I also saw it previously. Have you seen that before? Am I missing something?
@jeffchuber I think I am able to narrow the bug / inconsistency. I've recorded a detailed loom, you can see the bug in all its glory starting 3:05: https://www.loom.com/share/5b7307cdbe244ea7868092989d2172a3
I had a hard time triggering it but i think i found the pattern, if you want the loom video in full you can notice the pattern below in action.
Works fine: create and persist data in chroma restart kernel read from the persisted folder all good
Doesn't work: create and persist data in chroma delete folder with persisted data without restarting kernel recreate the folder restart kernel (if you want) attempt to read from persisted folder, you will get [] anytime you restart the kernel you can't read from the folder, you will receive [] when quering
@jeffchuber do you have any hunches? I have a few, but tbh this still doesn't explain why things didn't work for me the very first time.
@dankolesnikov I think this is the problem
https://github.com/chroma-core/chroma/blob/main/chromadb/db/duckdb.py#L412
I think the atExit
magic for saving get inconsistently called via process exit especially in the case of many clients being generated.
Let me try what you said too...
I suspect this bug is due to langchain implicitly creating many different chroma clients under the hood, which will not work. The chroma client should be treated as a singleton, and we should separately add some validation for this.
What is happening is that these separate clients are stomping on each other atExit.
Langchain supports passing in a client OR the settings/persist directory. Can people experiencing this issue instead pass the client into Chroma.from_documents and similar functions? This should resolve your issue. Thanks!
Apologies for the confusion.
That makes sense guys! Thank you. Please advise if a langchain ticket should be created or if there is an existing one? I think we can close this one? @jeffchuber I think this is an important area to fix because people use langchain a lot to interface with vector dbs and results in folks churning from chroma.
Does Chroma team own the development and maintenance of the Chroma implementation in Langchain?
Just chatted with Harrison about a few things earlier today! We own the Chroma implementation (with a ton of community help), but this is going to hard for us to enforce.... it seems the real "offending" method are from_documents
and from_texts
since that is the footgun here.
I'll need to think more about this.
@dankolesnikov any ideas?
I will noodle on this more and can propose a solution, happy to help contribute on this front, I am familiar with the Chroma implementation in langchain partially due to the work on Auto Evaluator: https://autoevaluator.langchain.com/
@dankolesnikov that would be really great! 🎉
@dankolesnikov @jeffchuber I am also facing the same issue , I am using chromadb with langchain , any temporary fix you guy's would suggest on this ?
Hello team. Any updates on this? Another langchain user with the same issue. Thanks for your hard work!
@satishmaddula @sbslee does this help? https://docs.trychroma.com/troubleshooting#your-index-resets-back-to-just-a-few-number-of-records
@jeffchuber,
Thanks for the reply. The link does give me some insight as to why this problem occurs. I'm new to both langchain and chromadb so I will need to study more, but can I ask a follow-up question?
This problem seems to be Windows-specific (another user in this thread also appears to be a Windows user), because I'm a Mac user and this never happened to me but it always happens to my coworker who uses Windows. The problem also seems somewhat random because it's really difficult to reproduce -- once you create a vector store the chatbot runs perfectly fine for a while but then all of a sudden it stops retrieving request data and my coworker is always forced to recreate a new vector store and the cycle goes on. I know it's a pain to address an issue that is not reproducible and that's why I haven't raised any issues in this repo (I just told my coworker to constantly recreate a new vector space) but I just came across this post and I thought I would chime in. (My coworker and) I would greatly appreciate if you can address these concerns.
To give you more specific context of our langchain + chromadb implementation, I will share my code:
Basically, I'm building a GUI for various chatbots and one of them is called DocGPT which allows you to chat with your documents. Users have a choice to either create a new database or use an existing database. Everything works fine to me but whenever my coworker tries to use an existing database, 10% of the time the chatbot goes blind to the database and keeps complaining that the requested data is not there even though it gave a perfect answer a minute ago. When this happens, the chatbot is gone because no matter how many times my coworker tries to connect to the database, the request data is not there. Of course, I suspected that maybe some data were overwritten as the provided link suggests but when I look at the index and parquet files (chroma-collections.parquet and chroma-embeddings.parquet) of my coworker's computer the time mark of last modification is unchanged indicating to me that no data were overwritten.
@sbslee that same small snippet of code works for you, but not your coworker on windows?
@jeffchuber, there is a little more nuance, but basically, yes. Note again that the issue doesn't happen 100% of the time even for my coworker (roughly 10%). We are testing to use a different vector store and see if the same problem occurs.
@jeffchuber,
I stand corrected. Previously, I mentioned:
Of course, I suspected that maybe some data were overwritten as the provided link suggests but when I look at the index and parquet files (chroma-collections.parquet and chroma-embeddings.parquet) of my coworker's computer the time mark of last modification is unchanged indicating to me that no data were overwritten.
This is not true! I made my coworker monitor carefully the contents of the chroma-collections.parquet
and chroma-embeddings.parquet
files. We found that when the chatbot was no longer able to retrieve requested data (reminding you, this does not happen 100% of the time but when it does happen the chatbot is gone for good), some of the contents of the chroma-embeddings.parquet
file were indeed deleted. We are still investigating what triggers this data deletion event, but I agree with you 100% this issue is tightly associated with the link you suggested. I will give you an update when we discover what prompts this issue.
@sbslee we are releasing an updated version of chroma next wednesday that will persist on every write and will totally this weird runtime behavior. stay tuned!
We are still investigating what triggers this data deletion event
I can confirm that this was triggered when my coworker tried to re-connect to the database while the temp parquet files were being processed (some of his documents were huge and it took some time to process these files). The reason it appeared random at first was because we didn't realize the existence of these temp files (they only exist, well, temporarily so we never spotted them). But when he started waiting for these temp files to clear up before re-connecting to the database, the problem is now gone. Hope this helps future users.
@sbslee oh this is great to know! our release later today (announcing monday) switches to incremental persisting and will avoid this
Closing this as it is stale. Please let me know if anything else pops up here and we can re-open it.
this happened again on my end now @jeffchuber , some how when I run the code below to try to load a saved vecter store, it creates a new folder (by that I mean, the chorma folder used to have 2 items, one is chorma.sqlite3 file, the other is a folder containing {header.bin, index_metadata.pickle and such}, it creates a empty folder with name similar to that folder) but noting in the empty folder and even if I deleted the new folder, it still would not work
local_model_path = '/content/drive/MyDrive/AIOPS_RAG_Utils/distiluse-base-multilingual-cased-v1'
embedding_function = SentenceTransformerEmbeddings(model_name=local_model_path)
vectorstore=Chroma(persist_directory="/content/drive/MyDrive/AIOPS_RAG_Utils/chroma_db_2", embedding_function=embedding_function)
retriever = vectorstore.as_retriever()
docs = retriever.get_relevant_documents('how is xx')
docs
@bzr1 make sure you've set the collection_name correctly
What happened?
The following example uses langchain to successfully load documents into chroma and to successfully persist the data. Querying works as expected.
However, when we restart the notebook and attempt to query again without ingesting data and instead reading the persisted directory, we get [] when querying both using the langchain wrapper's method and chromadb's client (accessed from langchain wrapper). I've concluded that there is either a deep bug in chromadb or I am doing something wrong. Please help!
Code to reproduce
In the notebook paste:
Now lets restart the notebook, and run the following code which I believe should work.
To make sure its not langchain issue, lets get the chroma client from the Chroma class and query the data at a lower level.
Am I doing something wrong? I am attaching the JSON that is fed into the data as a zip file because GitHub doesn't allow json to make it easier to reproduce but i believe you can reproduce using any document (my assumption).
data_auto_chapters.json.zip
Versions
Python 3.8.8 langchain 0.0.184 chroma 0.3.25
Relevant log output
No response