Open huanggefan opened 1 year ago
@huanggefan why are you removing index_metadata.pickle
?
when removing index_metadata.pickle it can work, if index_metadata.pickle exists, it will except RuntimeError
@huanggefan i see. our tests havent been able to pick this up, and we do test this quite a bit.
here are a few ideas from chatgpt...
The error traceback you've provided suggests that the program is having trouble opening a file, specifically a header file. This could be due to a number of reasons:
File Doesn't Exist: The file it's trying to open might not exist. Verify that the file is indeed present at the expected location.
Incorrect Path: If the file does exist, make sure the program is looking for it in the right place. The path might be relative or absolute. If it's relative, it's relative to the working directory when you started the program.
File Permissions: The program may not have sufficient permissions to open the file. Check the permissions on the file and ensure that the user running the program has the necessary permissions to read the file.
File is Being Used by Another Process: The file could be locked or being used by another process. Make sure no other processes are using the file when you're trying to run your program.
To resolve this issue, try to manually access the file path the software is trying to use. If you can access it, and the file is there, check if the file is locked or if the process has appropriate permissions to access it. Also, ensure that the file is not currently being used by another process. If all these are fine, the problem might be with the software itself - perhaps it's not handling paths correctly or it's incorrectly formulating the path to the file.
This error indicates the index did not intialize or save.
Does the file 64597cda-24ba-4d7a-8fc4-b96f1fc098d9/header.bin
exist?
By deleting files a you have, you may have corrupted your db and need to restart from scratch.
I run as local
chroma_client = chromadb.PersistentClient(path=str(self.collection_dir))
chroma_collection = chroma_client.get_or_create_collection(name=self.name)
chroma_collection.upsert(ids=[i_d], embeddings=[chunk_embedding], documents=[document_chunk], metadatas=[metadata])
...
chroma_client.stop()
only index_metadata.pickle, no header.bin
@HammadB
Hmmm, that does not make sense. What version of chroma-hnswlib do you have installed?
i find chroma_hnswlib at env\Lib\site-packages\chroma_hnswlib-0.7.1.dist-info
chroma_hnswlib 0.7.1
@HammadB
i try use chroma_hnswlib 0.7.2, but look the same error: RuntimeError: Cannot open header file @HammadB
@huanggefan, does this happen when you use a clean directory? The implication of this error is that somehow the underlying vector index has not been created, but since you have deleted files, the index is in a halfway state. When the folder exists, we assume the index has been created and initialized properly.
I am trying to understand if the error is because of your partially constructed index or because of some deeper issue. Can you share the script you are using to reproduce the issue?
I found the key to the problem
@HammadB
@huanggefan - interesting. I will take a look at your code. But to me this implies something strange with the ability to create and use the files on SSD/HDD vs other disk types. Have not seen that issue before, many people are running on SSD/HDD just fine. Do you have any insight into your specific platform and why this might be the case?
I set collection_dir to RamDisk and ran the code for a while. After that, I found a clean folder on the SSD, clear collection_dir and re-imported the data. Surprisingly, everything is working fine now.
This problem is quite strange now. I'll try restarting the computer a few times and check it out after I finish work.
@HammadB
at now version: chroma-hnswlib 0.7.2 chromadb 0.4.7
When I noticed that the disk was no longer writing data, meaning that files like index_metadata.pickle
were no longer changing
I have encountered this error,I'm not sure if this is related to the issue.
Exception occurred invoking consumer for subscription acc30d643e2541a7bc0eebf1caec76e2to topic persistent://default/default/1ea6b644-e293-4cdf-ae8c-35e709d9ad8d Index with capacity 100 and 100 current entries cannot add 1 records
@HammadB @jeffchuber
I experienced this same phenomenon when using Japanese directory names. I solved this problem by using English directory names.
I think we should atomically rename a new index_metadata instead of overwriting. I can only suspect that the file got corrupted
I am having the same probelm. I run my code on Google Colab.
# Chroma Location
chroma_location = "chroma_24_nov_2023"
vectorDBlocation = os.path.join("/content/drive/My Drive/", chroma_location)
client = chromadb.PersistentClient(path=vectorDBlocation)
# Embedder
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2", device=processing_device)
# collection_name
collection_name = 'ABC_Collection'
distance = 'cosine'
# Get a collection object from an existing collection, by name. If it doesn't exist, create it.
collection = client.get_or_create_collection(name=collection_name,
embedding_function=sentence_transformer_ef,
metadata={"hnsw:space": distance})
# Use this function to add data to the collection
def add_to_collection(ids, text_chunks, extended_meta_data):
collection.add(
ids=ids,
documents=text_chunks,
metadatas=extended_meta_data
)
The code was running fine the first time data was added to the collection. When I restarted the kernel and tried to use the system the next morning, I started getting the same error.
For example collection.peek(), returned the following error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
[<ipython-input-18-d0cc5020d39d>](https://localhost:8080/#) in <cell line: 1>()
----> 1 collection.peek()
10 frames
[/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py](https://localhost:8080/#) in peek(self, limit)
235 GetResult: A GetResult object containing the results.
236 """
--> 237 return self._client._peek(self.id, limit)
238
239 def query(
[/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
125 global tracer, granularity
126 if trace_granularity < granularity:
--> 127 return f(*args, **kwargs)
128 if not tracer:
129 return f(*args, **kwargs)
[/usr/local/lib/python3.10/dist-packages/chromadb/api/segment.py](https://localhost:8080/#) in _peek(self, collection_id, n)
745 def _peek(self, collection_id: UUID, n: int = 10) -> GetResult:
746 add_attributes_to_current_span({"collection_id": str(collection_id)})
--> 747 return self._get(collection_id, limit=n) # type: ignore
748
749 @override
[/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
125 global tracer, granularity
126 if trace_granularity < granularity:
--> 127 return f(*args, **kwargs)
128 if not tracer:
129 return f(*args, **kwargs)
[/usr/local/lib/python3.10/dist-packages/chromadb/api/segment.py](https://localhost:8080/#) in _get(self, collection_id, ids, where, sort, limit, offset, page, page_size, where_document, include)
508 if "embeddings" in include:
509 vector_ids = [r["id"] for r in records]
--> 510 vector_segment = self._manager.get_segment(collection_id, VectorReader)
511 vectors = vector_segment.get_vectors(ids=vector_ids)
512
[/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
125 global tracer, granularity
126 if trace_granularity < granularity:
--> 127 return f(*args, **kwargs)
128 if not tracer:
129 return f(*args, **kwargs)
[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/manager/local.py](https://localhost:8080/#) in get_segment(self, collection_id, type)
157 # creates the instance.
158 with self._lock:
--> 159 instance = self._instance(self._segment_cache[collection_id][scope])
160 return cast(S, instance)
161
[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/manager/local.py](https://localhost:8080/#) in _instance(self, segment)
186 if segment["id"] not in self._instances:
187 cls = self._cls(segment)
--> 188 instance = cls(self._system, segment)
189 instance.start()
190 self._instances[segment["id"]] = instance
[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py](https://localhost:8080/#) in __init__(self, system, segment)
117 if len(self._id_to_label) > 0:
118 self._dimensionality = cast(int, self._dimensionality)
--> 119 self._init_index(self._dimensionality)
120 else:
121 self._persist_data = PersistentData(
[/usr/local/lib/python3.10/dist-packages/chromadb/telemetry/opentelemetry/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
125 global tracer, granularity
126 if trace_granularity < granularity:
--> 127 return f(*args, **kwargs)
128 if not tracer:
129 return f(*args, **kwargs)
[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py](https://localhost:8080/#) in _init_index(self, dimensionality)
162 # Check if index exists and load it if it does
163 if self._index_exists():
--> 164 index.load_index(
165 self._get_storage_folder(),
166 is_persistent_index=True,
RuntimeError: Cannot open header file
And when I look at the "chroma_24_nov_2023" directory I only see the "chroma.sqlite3" in the main folder and a single file called "index_metadata.pickle" in the "8c313532-435c-45a4-97dc-46b92f45e71a" directory
But strangely, when I deleted the "8c313532-435c-45a4-97dc-46b92f45e71a" directory and run the code again, it works. I get a new directory called "8c313532-435c-45a4-97dc-46b92f45e71a" (The exact same name as before) but everything works now! This is a serious problem if we want to use this code in production! Any fix?
I have the same problem, there is no solution?
I also encountered the same problem, there is only one index_metadata.pickle file in the folder, after running the search error, cannot open header file, and subsequently cannot add files and search documents in this collection, may I ask anyone to solve it?
In my opinion, it's because there is 1000 limited data of the database. if you change the number 1000 into 10000 in all of the py file, you may solve the problem.
---- Replied Message ---- | From | @.> | | Date | 06/03/2024 15:22 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [chroma-core/chroma] [Bug]: collection.query RuntimeError: Cannot open header file (Issue #872) |
I also encountered the same problem, there is only one index_metadata.pickle file in the folder, after running the search error, cannot open header file, and subsequently cannot add files and search documents in this collection, may I ask anyone to solve it?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Thank you, I found the reason, is the folder needs to be named in English, otherwise there will be this error. @okamura-0422
In my opinion, it's because there is 1000 limited data of the database. if you change the number 1000 into 10000 in all of the py file, you may solve the problem. … ---- Replied Message ---- | From | @.> | | Date | 06/03/2024 15:22 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [chroma-core/chroma] [Bug]: collection.query RuntimeError: Cannot open header file (Issue #872) | I also encountered the same problem, there is only one index_metadata.pickle file in the folder, after running the search error, cannot open header file, and subsequently cannot add files and search documents in this collection, may I ask anyone to solve it? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Thank you, I found the reason, is the folder needs to be named in English, otherwise there will be this error
Можно удалить файл index_metadata.pickle и всё заработает!
index_metadata_path = PATHDB+'/a4f2f538-d3b2-4906-81b9-de5e77c40d9e/index_metadata.pickle'
if os.path.exists(index_metadata_path): os.remove(index_metadata_path) print(f"Файл {index_metadata_path} был удален.")
Исправлять название папок на англоязычные нужно до того как файл index_metadata.pickle будет автоматически создан базой при первом обращении к ней. Короче, сначала его нужно удалить, а потом исправлять название всего пути на англоязычные папки. В этом случае, когда файл index_metadata.pickle создастся очередной раз при обращении к базе, там уже всё будет корректно и в дальнейшем его не нужно будет удалять. Изначальная причина ошибок в русскоязычных названиях папок в адресе расположения базы данных.
`[/usr/local/lib/python3.10/dist-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py](https://localhost:8080/#) in _init_index(self, dimensionality)
206 # Check if index exists and load it if it does
207 if self._index_exists():
--> 208 index.load_index(
209 self._get_storage_folder(),
210 is_persistent_index=True,
RuntimeError: Cannot open data_level0 file`
First time making the collection, everything works just fine. But cannot load the db again. The file is just not being created. I don't have non-eng characters in the path. I'm going to downgrade the chromadb version and try see if .persist() can force create the file.
Update: the old version doesn't support id string format containing '
character, so it took a while.
But, it seems forcing the persist will save everything in my drive as expected.
It makes sense coz constant rewrites would be a problem corrupting the filesystem.
What happened?
1. init data
chroma_client = chromadb.PersistentClient(path=str(collection_dir))
chroma_collection = chroma_client.get_or_create_collection(name=collection_name)
chroma_collection.upsert(ids=[id], embeddings=[embedding], documents=[document], metadatas=[metadata])
2. query and RuntimeError except
collection.query(query_embeddings=query_embeddings, n_results=100)
3. remove index_metadata.pickle
rm 64597cda-24ba-4d7a-8fc4-b96f1fc098d9/index_metadata.pickle
4. query and no RuntimeError except
5. restart program and query and RuntimeError except
6. remove index_metadata.pickle and restart program and query and RuntimeError no except
so, when index_metadata.pickle exists, will except RuntimeError
Versions
chromadb 0.4.2 python 3.11.4 windows 11
Relevant log output